MedRAGChecker: Claim-Level Verification for Biomedical Retrieval-Augmented Generation

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the safety risks posed by biomedical retrieval-augmented generation (RAG) systems, which often produce unsupported or even evidence-contradictory claims in long-form responses. To tackle this issue, the authors propose MedRAGChecker, the first framework enabling claim-level, fine-grained verification of RAG outputs. It decomposes generated answers into atomic claims and integrates evidence-driven natural language inference with consistency checks against biomedical knowledge graphs. Leveraging reliability-weighted ensemble scoring and lightweight model distillation, MedRAGChecker effectively identifies unsupported or contradictory claims across four biomedical question-answering benchmarks. The approach precisely distinguishes failure modes originating from retrieval versus generation components and reveals critical differences among models in their handling of safety-sensitive relational assertions.

Technology Category

Application Category

📝 Abstract
Biomedical retrieval-augmented generation (RAG) can ground LLM answers in medical literature, yet long-form outputs often contain isolated unsupported or contradictory claims with safety implications. We introduce MedRAGChecker, a claim-level verification and diagnostic framework for biomedical RAG. Given a question, retrieved evidence, and a generated answer, MedRAGChecker decomposes the answer into atomic claims and estimates claim support by combining evidence-grounded natural language inference (NLI) with biomedical knowledge-graph (KG) consistency signals. Aggregating claim decisions yields answer-level diagnostics that help disentangle retrieval and generation failures, including faithfulness, under-evidence, contradiction, and safety-critical error rates. To enable scalable evaluation, we distill the pipeline into compact biomedical models and use an ensemble verifier with class-specific reliability weighting. Experiments on four biomedical QA benchmarks show that MedRAGChecker reliably flags unsupported and contradicted claims and reveals distinct risk profiles across generators, particularly on safety-critical biomedical relations.
Problem

Research questions and friction points this paper is trying to address.

biomedical retrieval-augmented generation
claim-level verification
unsupported claims
contradictory claims
safety-critical errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

claim-level verification
biomedical RAG
knowledge graph consistency
natural language inference
safety-critical error detection
🔎 Similar Papers
No similar papers found.
Yuelyu Ji
Yuelyu Ji
University of Pittsburgh
Natural language processingHealth information detectionLarge language model evaluation
M
Min Gu Kwak
University of Pittsburgh, Pittsburgh, PA, USA
Hang Zhang
Hang Zhang
University of Pittsburgh
Representation LearningAI in Healthcare
X
Xizhi Wu
University of Pittsburgh, Pittsburgh, PA, USA
C
Chenyu Li
University of Pittsburgh, Pittsburgh, PA, USA
Y
Yanshan Wang
University of Pittsburgh, Pittsburgh, PA, USA