ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

๐Ÿ“… 2024-10-15
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address hallucinations in Retrieval-Augmented Generation (RAG) models caused by entanglement between parametric knowledge and externally retrieved knowledge, this paper conducts mechanistic interpretability analysis on the residual streamโ€”revealing for the first time that hallucinations stem from excessive reliance of feed-forward networks (FFNs) on internal parametric knowledge and failure of copy heads to effectively integrate external content. Building on this insight, we propose a Knowledge Utilization Decoupling Detection paradigm and the Adaptive Activation Rescaling and Filtering (AARF) hallucination mitigation mechanism. Our approach models functional decoupling between FFNs and copy heads via targeted residual stream interventions and introduces an interpretable, lightweight hallucination detector. Evaluated across multiple benchmarks, our detector achieves an average 12.7% improvement in hallucination detection accuracy, while the AARF module reduces hallucination rates by up to 38.5%. Crucially, both components require no fine-tuning or additional training.

Technology Category

Application Category

๐Ÿ“ Abstract
Retrieval-Augmented Generation (RAG) models are designed to incorporate external knowledge, reducing hallucinations caused by insufficient parametric (internal) knowledge. However, even with accurate and relevant retrieved content, RAG models can still produce hallucinations by generating outputs that conflict with the retrieved information. Detecting such hallucinations requires disentangling how Large Language Models (LLMs) utilize external and parametric knowledge. Current detection methods often focus on one of these mechanisms or without decoupling their intertwined effects, making accurate detection difficult. In this paper, we investigate the internal mechanisms behind hallucinations in RAG scenarios. We discover hallucinations occur when the Knowledge FFNs in LLMs overemphasize parametric knowledge in the residual stream, while Copying Heads fail to effectively retain or integrate external knowledge from retrieved content. Based on these findings, we propose ReDeEP, a novel method that detects hallucinations by decoupling LLM's utilization of external context and parametric knowledge. Our experiments show that ReDeEP significantly improves RAG hallucination detection accuracy. Additionally, we introduce AARF, which mitigates hallucinations by modulating the contributions of Knowledge FFNs and Copying Heads.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Factuality Issues
Retrieval-Augmented Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReDeEP
AARF
RAG model accuracy enhancement
๐Ÿ”Ž Similar Papers
No similar papers found.
Zhongxiang Sun
Zhongxiang Sun
Renmin University of China
SearchRecommendationLLMLegal
Xiaoxue Zang
Xiaoxue Zang
Kuaishou Technology
Recommender SystemNLPDialogueMultimodal Modeling
K
Kai Zheng
Kuaishou Technology Co., Ltd., Beijing, China
J
Jun Xu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
X
Xiao Zhang
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
W
Weijie Yu
School of Information Technology and Management, University of International Business and Economics
Y
Yang Song
Kuaishou Technology Co., Ltd., Beijing, China
H
Han Li
Kuaishou Technology Co., Ltd., Beijing, China