Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reliability challenges faced by Retrieval-Augmented Generation (RAG) systems in answering questions about AI governance policies, where dense legal language and overlapping, evolving regulations complicate accurate responses. Building upon the AGORA corpus, the authors develop a domain-adapted RAG system that integrates a ColBERT retriever fine-tuned with contrastive learning and aligns the generator with human preferences via Direct Preference Optimization (DPO), using synthetic queries and pairwise preference ratings. Experimental results reveal that while domain-specific fine-tuning enhances retrieval performance, it does not consistently improve end-to-end answer relevance or faithfulness. Notably, stronger retrieval capabilities can paradoxically lead to more confident hallucinations when critical documents are missed, challenging the common assumption that better retrieval inherently yields better answers and highlighting a key limitation in the synergy between retrieval and generation within RAG systems.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
AI Policy QA
Hallucination
Regulatory Corpus
Answer Faithfulness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation (RAG)
ColBERT
Direct Preference Optimization (DPO)
AI Policy QA
Hallucination Analysis
🔎 Similar Papers
No similar papers found.
S
Saahil Mathur
Department of Computer Science, Purdue University, West Lafayette, IN 47907
R
Ryan David Rittner
Department of Computer Science, Purdue University, West Lafayette, IN 47907
V
Vedant Ajit Thakur
Department of Political Science, Purdue University, West Lafayette, IN 47907
D
Daniel Stuart Schiff
Department of Political Science, Purdue University, West Lafayette, IN 47907
Tunazzina Islam
Tunazzina Islam
Visiting Assistant Professor CS @Purdue University, Ph.D. in CS @Purdue University
Natural Language ProcessingComputational Social ScienceArtificial Intelligence