Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Hallucination—where RAG-generated outputs contradict or exceed retrieved evidence—remains pervasive, and existing detection methods rely heavily on costly annotated data or external LLM-based evaluation, suffering from poor interpretability and scalability. Method: We introduce RAGLens, the first hallucination detector for RAG that leverages sparse autoencoders (SAEs) from mechanistic interpretability. RAGLens operates internally by analyzing layer-wise LLM activations, integrating information-theoretic feature selection with additive modeling to isolate hallucination-specific neural activation patterns. Contribution/Results: Evaluated across multiple benchmarks, RAGLens significantly outperforms state-of-the-art detectors in accuracy while remaining lightweight and fully interpretable. It reveals systematic cross-layer distributions of hallucination signals and identifies recurrent neuron groups associated with factual inconsistency, enabling both post-hoc intervention and mechanistic analysis of hallucination generation.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.
Problem

Research questions and friction points this paper is trying to address.

Detects unfaithful outputs in retrieval-augmented generation systems
Reduces reliance on costly training data or external LLM judges
Uses sparse autoencoders to interpret internal model activations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoders to identify hallucination features
Implements lightweight detector with internal LLM representations
Provides interpretable rationales for post-hoc mitigation
🔎 Similar Papers
No similar papers found.
Guangzhi Xiong
Guangzhi Xiong
University of Virginia
Z
Zhenghao He
Department of Computer Science, University of Virginia
B
Bohan Liu
Department of Computer Science, University of Virginia
Sanchit Sinha
Sanchit Sinha
University of Virginia
Natural Language ProcessingMachine LearningComputer Vision
A
Aidong Zhang
Department of Computer Science, University of Virginia