🤖 AI Summary
To address low sentence-level accuracy, strong domain dependency, and high computational overhead of self-consistency methods in hallucination detection for large language models (LLMs), this paper proposes an efficient decoding-memory framework. We first identify substantial prefix redundancy in self-consistent multi-path generation, revealing that non-answer tokens contribute minimally to semantic discrimination. Leveraging this insight, we design a model- and decoding-agnostic decoding-memory pipeline integrating shared-prefix identification, selective inference, and annealed decoding. Our method achieves up to 3× generation speedup without degrading AUROC performance and generalizes across diverse tasks. The core innovation lies in transforming redundant computation into acceleration resources, establishing a new paradigm for lightweight, general-purpose hallucination detection.
📝 Abstract
Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.