🤖 AI Summary
Existing open-domain RAG benchmarks struggle to accurately evaluate real-world performance in specialized domains—such as defense—due to pretraining data contamination and weak grounding constraints. To address this, this work introduces DoRA, the first controllable, synthetically generated RAG evaluation benchmark tailored for professional domains. DoRA constructs a high-quality test set of 6.5K human-verified samples spanning five task categories, leveraging intent-conditioned question generation aligned with auditable evidence passages. It further incorporates stringent grounding requirements and a contamination-aware regression testing mechanism. Supervised fine-tuning of Llama3.1-8B-Instruct on DoRA yields up to a 26% improvement in question-answering success rate and a 47% reduction in hallucination rate, substantially enhancing the faithfulness and reliability of RAG systems in professional settings.
📝 Abstract
Open-domain RAG benchmarks over public corpora can overestimate deployment performance due to pretraining overlap and weak attribution requirements. We present DoRA (Domain-oriented RAG Assessment), a domain-grounded benchmark built from defense documents that pairs synthetic, intent-conditioned QA (question answering) with auditable evidence passages for attribution. DoRA covers five question types (find, explain, summarize, generate, provide) and contains 6.5K curated instances. In end-to-end evaluation with a fixed dense retriever, general-purpose Language Models (LMs) perform similarly, while a model trained on DoRA (DoRA SFT) yields large gains over the base model (Llama3.1-8B-Instruct): up to 26% improvement in QA task success, while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.