Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing open-domain RAG benchmarks struggle to accurately evaluate real-world performance in specialized domains—such as defense—due to pretraining data contamination and weak grounding constraints. To address this, this work introduces DoRA, the first controllable, synthetically generated RAG evaluation benchmark tailored for professional domains. DoRA constructs a high-quality test set of 6.5K human-verified samples spanning five task categories, leveraging intent-conditioned question generation aligned with auditable evidence passages. It further incorporates stringent grounding requirements and a contamination-aware regression testing mechanism. Supervised fine-tuning of Llama3.1-8B-Instruct on DoRA yields up to a 26% improvement in question-answering success rate and a 47% reduction in hallucination rate, substantially enhancing the faithfulness and reliability of RAG systems in professional settings.

Technology Category

Application Category

📝 Abstract

Open-domain RAG benchmarks over public corpora can overestimate deployment performance due to pretraining overlap and weak attribution requirements. We present DoRA (Domain-oriented RAG Assessment), a domain-grounded benchmark built from defense documents that pairs synthetic, intent-conditioned QA (question answering) with auditable evidence passages for attribution. DoRA covers five question types (find, explain, summarize, generate, provide) and contains 6.5K curated instances. In end-to-end evaluation with a fixed dense retriever, general-purpose Language Models (LMs) perform similarly, while a model trained on DoRA (DoRA SFT) yields large gains over the base model (Llama3.1-8B-Instruct): up to 26% improvement in QA task success, while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.

Problem

Research questions and friction points this paper is trying to address.

RAG

benchmarking

domain shift

hallucination

attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-oriented RAG

Synthetic Benchmarking

Attribution-aware QA