Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low sentence-level accuracy, strong domain dependency, and high computational overhead of self-consistency methods in hallucination detection for large language models (LLMs), this paper proposes an efficient decoding-memory framework. We first identify substantial prefix redundancy in self-consistent multi-path generation, revealing that non-answer tokens contribute minimally to semantic discrimination. Leveraging this insight, we design a model- and decoding-agnostic decoding-memory pipeline integrating shared-prefix identification, selective inference, and annealed decoding. Our method achieves up to 3× generation speedup without degrading AUROC performance and generalizes across diverse tasks. The core innovation lies in transforming redundant computation into acceleration resources, establishing a new paradigm for lightweight, general-purpose hallucination detection.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.
Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in large language model outputs
Reducing computational costs of self-consistency methods
Accelerating multi-response generation while maintaining accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective inference for efficient generation
Annealed decoding to reduce redundancy
Shared prefix token identification for acceleration