Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing evaluation methods struggle to diagnose answer inconsistencies in multi-source retrieval-augmented generation (RAG) systems arising from divergent source origins. This work proposes the first evaluation framework centered on inter-source relationships, systematically auditing source dependency in medical settings through a benchmark of real patient queries, a hierarchical retrieval strategy (HERO-QA), and a structured adjudicator based on a five-label classification scheme. The study reveals that answer discrepancies are substantially more prevalent than previously assumed, demonstrating that current approaches significantly underestimate the pervasiveness of this issue. Furthermore, the proposed framework exhibits strong transferability to legal and educational domains, establishing a generalizable paradigm for evaluating multi-source RAG systems.

📝 Abstract

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.

Problem

Research questions and friction points this paper is trying to address.

source-dependence

retrieval-augmented generation

multi-source RAG

answer disagreement

NLP evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

source-dependence

multi-source RAG

retrieval-augmented generation