🤖 AI Summary
This work addresses a critical gap in current retrieval-augmented generation (RAG) evaluation benchmarks, which overlook the high redundancy prevalent in real-world documents from domains such as finance, law, and patents, leading to distorted performance assessments. To bridge this gap, the authors propose RedQA, the first redundancy-aware evaluation framework. RedQA characterizes redundancy through atomic fact decomposition, leverages large language models combined with CRRF-based multi-criteria rank fusion to generate high-quality question-answer pairs, and introduces the PerfRecall@k metric. Experiments across three highly redundant domains reveal that even strong retrievers suffer dramatic performance drops—PerfRecall@10 falls from 66.4% to 5.0–27.9% in 4-hop tasks—exposing significant robustness deficiencies of existing methods in realistic settings and effectively narrowing the evaluation gap between standard benchmarks and real-world deployment scenarios.
📝 Abstract
Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii) enhancing LLM-based data generation with CRRF. RAG benchmark data usually requires multiple quality criteria, but LLMs often yield trivial outputs. CRRF scores criteria separately and fuses decisions by rank, improving the reliability of generated data. Applying RARE to Finance, Legal, and Patent corpora, we introduce RedQA, where a strong retriever baseline drops from 66.4% PerfRecall@10 on 4-hop General-Wiki to 5.0-27.9% PerfRecall@10 at 4-hop depth, revealing robustness gaps that current benchmarks fail to capture. RARE enables practitioners to build domain-specific RAG evaluations that faithfully reflect real-world deployment conditions.