🤖 AI Summary
This work addresses the limitation of existing Retrieval-Augmented Generation (RAG) evaluation methods, which predominantly focus on final answer quality and lack fine-grained assessment of individual evidence contributions during reasoning. The authors propose CUE-R, a novel framework that introduces lightweight intervention mechanisms—REMOVE, REPLACE, and DUPLICATE—to quantify the operational utility of evidence along four dimensions: correctness, proxy faithfulness, confidence error, and behavioral trajectory shift. This approach moves beyond answer-centric evaluation paradigms and reveals non-additive interaction effects among multi-hop evidence. Experiments on HotpotQA and 2WikiMultihopQA demonstrate that REMOVE and REPLACE significantly degrade performance and induce behavioral shifts, while DUPLICATE, though redundant, is not neutral; moreover, jointly removing multi-hop evidence causes performance degradation far exceeding that of single-evidence removal.
📝 Abstract
As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.