SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the pervasive hallucination problem of large language models (LLMs) in safety-critical applications, this paper proposes SAFE—a hallucination-aware query augmentation and mitigation framework based on sparse autoencoders (SAEs). Unlike conventional post-hoc hallucination detection paradigms, SAFE is the first to employ SAEs for query-level input reconstruction, enabling hallucination-aware semantic calibration and representation-level intervention. Its core innovation lies in an end-to-end hallucination-aware query rewriting mechanism that supports proactive hallucination suppression. Extensive experiments across three diverse, cross-domain datasets demonstrate that SAFE improves average query accuracy by 29.45%, substantially reduces hallucination generation rates, and achieves superior generalization performance compared to state-of-the-art baseline methods.

Technology Category

Application Category

📝 Abstract
Despite the state-of-the-art performance of Large Language Models (LLMs), these models often suffer from hallucinations, which can undermine their performance in critical applications. In this work, we propose SAFE, a novel method for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across three diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.
Problem

Research questions and friction points this paper is trying to address.

Detects and mitigates hallucinations in Large Language Models.
Uses Sparse Autoencoders for robust query enrichment.
Improves query generation accuracy by up to 29.45%.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Sparse Autoencoders for hallucination detection
Integrates SAEs for query enrichment in LLMs
Improves accuracy by up to 29.45% across datasets
🔎 Similar Papers
No similar papers found.