When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study addresses the reliability degradation of biomedical retrieval-augmented large language models when confronted with conflicting, erroneous, or reordered evidence. The authors introduce HealthContradict, a controlled evaluation benchmark encompassing five evidence conditions, and systematically assess six open-source models, revealing for the first time that evidence ordering alone can cause answer flips in 11.4%–25.2% of cases. To mitigate this issue, they propose a conflict-aware abstention mechanism that integrates model confidence with explicit evidence conflict detection, enabling proactive abstention under high uncertainty. Experimental results demonstrate that, in the most challenging setting dominated by incorrect evidence, this approach improves selective accuracy by 3.6–33.4 percentage points over confidence-only baselines across coverage levels of 25%–75%, thereby validating the efficacy of explicitly modeling evidence conflicts.

📝 Abstract

Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, this score improves selective accuracy over confidence-only, with mean gains of 7.2--33.4 points in incorrect-only (`IC') and 3.6--14.4 points in incorrect-first conflicting (`ICC') conditions across 75%, 50%, and 25% coverage. These results show that conflicting biomedical evidence is both an uncertainty and robustness problem and motivate evaluation and abstention methods that explicitly account for evidence disagreement.

Problem

Research questions and friction points this paper is trying to address.

conflicting evidence

retrieval-augmented LLMs

biomedical question answering

evidence order effects

model reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented generation

evidence conflict

order effects