Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing RAG evaluation benchmarks inadequately reflect real-world complexity—they fail to assess models’ ability to identify unanswerable questions, are vulnerable to “cheating” on multi-hop queries (e.g., via direct keyword matching), and often feature overly simplistic questions. Method: We propose the first automated, controllable pipeline for generating queries that are *provably unanswerable* within a given knowledge base, *semantically coherent*, *genuinely multi-hop*, and *immune to cheating*. Our approach enforces three constraints: (1) explicit multi-hop reasoning path specification, (2) semantic coherence control during question generation, and (3) rigorous retrieval-path validation to ensure no shortcut solutions exist. Contribution/Results: Evaluated on mainstream RAG datasets, our benchmark reduces model cheating rates by up to 81.0%, sharply exposing critical deficiencies in current systems’ multi-step reasoning and refusal capabilities. It establishes the first high-fidelity RAG evaluation framework grounded in realistic, non-cheatable query construction.

Technology Category

Application Category

📝 Abstract

Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$underline{c}$heatable, $underline{r}$ealistic, $underline{u}$nanswerable, and $underline{m}$ulti-hop $underline{q}$uerie$underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG systems on unanswerable and multi-hop queries

Addressing limitations of existing RAG benchmarks' realism

Creating automated pipeline for challenging query generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline generates uncheatable multi-hop queries

Difficulty-controlled creation adaptable to any corpus

Enhances benchmark realism for RAG system evaluation

🔎 Similar Papers

No similar papers found.