Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work addresses the persistent hallucination problem in retrieval-augmented generation (RAG) systems, which often occurs even when relevant documents are available, and highlights the limitations of conventional evaluation methods in identifying fine-grained issues in evidence utilization. The authors propose a diagnostic framework for question-answering tasks that decomposes questions into atomic reasoning facets and constructs a facet–text chunk matrix. By integrating retrieval relevance with natural language inference–based faithfulness scores, the framework analyzes evidence usage across three reasoning paradigms: Strict RAG, Soft RAG, and LLM-only. This approach enables, for the first time, facet-level diagnosis of RAG behavior, uncovering systematic failure modes—such as missing, misaligned, or prior-dominated evidence—that remain invisible at the answer level. Experimental results demonstrate that hallucinations primarily stem from flawed evidence integration rather than retrieval inaccuracies, offering an interpretable foundation for improving RAG systems.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

Problem

Research questions and friction points this paper is trying to address.

hallucination

retrieval-augmented generation

evidence grounding

facet-level analysis

retrieval-generation misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Facet-level Analysis

Evidence Grounding

Retrieval-Augmented Generation