CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

This work addresses the limitation of existing Retrieval-Augmented Generation (RAG) evaluation methods, which predominantly focus on final answer quality and lack fine-grained assessment of individual evidence contributions during reasoning. The authors propose CUE-R, a novel framework that introduces lightweight intervention mechanisms—REMOVE, REPLACE, and DUPLICATE—to quantify the operational utility of evidence along four dimensions: correctness, proxy faithfulness, confidence error, and behavioral trajectory shift. This approach moves beyond answer-centric evaluation paradigms and reveals non-additive interaction effects among multi-hop evidence. Experiments on HotpotQA and 2WikiMultihopQA demonstrate that REMOVE and REPLACE significantly degrade performance and induce behavioral shifts, while DUPLICATE, though redundant, is not neutral; moreover, jointly removing multi-hop evidence causes performance degradation far exceeding that of single-evidence removal.

Technology Category

Application Category

📝 Abstract

As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

evidence utility

RAG evaluation

intervention-based analysis

per-evidence-item

Innovation

Methods, ideas, or system contributions that make the work stand out.

intervention-based evaluation

retrieval-augmented generation

evidence utility