Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG evaluation benchmarks inadequately reflect real-world complexity—they fail to assess models’ ability to identify unanswerable questions, are vulnerable to “cheating” on multi-hop queries (e.g., via direct keyword matching), and often feature overly simplistic questions. Method: We propose the first automated, controllable pipeline for generating queries that are *provably unanswerable* within a given knowledge base, *semantically coherent*, *genuinely multi-hop*, and *immune to cheating*. Our approach enforces three constraints: (1) explicit multi-hop reasoning path specification, (2) semantic coherence control during question generation, and (3) rigorous retrieval-path validation to ensure no shortcut solutions exist. Contribution/Results: Evaluated on mainstream RAG datasets, our benchmark reduces model cheating rates by up to 81.0%, sharply exposing critical deficiencies in current systems’ multi-step reasoning and refusal capabilities. It establishes the first high-fidelity RAG evaluation framework grounded in realistic, non-cheatable query construction.

Technology Category

Application Category

📝 Abstract
Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$underline{c}$heatable, $underline{r}$ealistic, $underline{u}$nanswerable, and $underline{m}$ulti-hop $underline{q}$uerie$underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG systems on unanswerable and multi-hop queries
Addressing limitations of existing RAG benchmarks' realism
Creating automated pipeline for challenging query generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline generates uncheatable multi-hop queries
Difficulty-controlled creation adaptable to any corpus
Enhances benchmark realism for RAG system evaluation
🔎 Similar Papers
No similar papers found.