A Reality Check on Context Utilisation for Retrieval-Augmented Generation

📅 2024-12-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing RAG evaluation heavily relies on synthetic contexts, leading to severe overestimation—by over 37%—of LMs’ ability to leverage *real* retrieval outputs and neglecting critical real-world factors such as context provenance and reliability. Method: We introduce DRUID, the first human-annotated, fact-checking–oriented dataset of *real* retrieval contexts, covering prevalent challenges including unreliability, insufficiency, and poor comprehensibility; we propose ACU (Attention-based Context Utilization), a novel metric quantifying how effectively LMs attend to and exploit contextual information. Contribution/Results: Empirical analysis reveals that context provenance alone explains 5.2× more variance in ACU than any other single attribute; we provide the first systematic evidence of fundamental distributional and mechanistic disparities between real and synthetic contexts. This work establishes a new, reality-aligned benchmark for rigorous RAG evaluation.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluates real-world context complexity in retrieval-augmented generation (RAG).

Highlights limitations of synthetic datasets in representing realistic retrieval scenarios.

Identifies gaps in understanding context-source properties affecting RAG performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DRUID dataset with real-world queries

Compares DRUID to synthetic datasets like CounterFact

Proposes ACU score for context utilisation measurement

🔎 Similar Papers

No similar papers found.