Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
AI 摘要
针对AI政策RAG系统,研究发现检索性能提升不保证问答质量提升,甚至可能导致更强的幻觉。
主要贡献
- 评估了RAG在AI政策问答中的应用效果
- 发现提升检索质量不一定提高问答质量
- 揭示了RAG系统在复杂政策文档上的局限性
方法论
使用AGORA语料库,结合ColBERT检索器和DPO对齐的生成器,通过对比学习和人工偏好调整系统。
原文摘要
Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.