Predictive Coding and Information Bottleneck for Hallucination Detection in Large Language Models

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Large language models are often hindered from deployment in high-stakes scenarios due to factual hallucinations. This work proposes a lightweight, interpretable hallucination detection method grounded in predictive coding and the information bottleneck principle. By integrating three mechanisms—entity-focused absorption, contextual adherence, and falsifiability scoring—the approach constructs a supervised model with fewer than 1 million parameters, requiring no external retrieval or large black-box components. Evaluated on HaluBench, the method achieves an AUROC of 0.8669 while using only 1/75th the training data of competing approaches and offering a 1000× speedup in inference (5 ms vs. 5 s). It demonstrates exceptional data efficiency, full interpretability, and practical deployability, while also revealing the limitations of rationalization signals in hallucination identification.

Technology Category

Application Category

📝 Abstract

Hallucinations in Large Language Models (LLMs) -- generations that are plausible but factually unfaithful -- remain a critical barrier to high-stakes deployment. Current detection methods typically rely on computationally expensive external retrieval loops or opaque black-box LLM judges requiring 70B+ parameters. In this work, we introduce [Model Name], a hybrid detection framework that combines neuroscience-inspired signal design with supervised machine learning. We extract interpretable signals grounded in Predictive Coding (quantifying surprise against internal priors) and the Information Bottleneck (measuring signal retention under perturbation). Through systematic ablation, we demonstrate three key enhancements: Entity-Focused Uptake (concentrating on high-value tokens), Context Adherence (measuring grounding strength), and Falsifiability Score (detecting confident but contradictory claims). Evaluating on HaluBench (n=200, perfectly balanced), our theory-guided baseline achieves 0.8017 AUROC. BASE supervised models reach 0.8274 AUROC, while IMPROVED features boost performance to 0.8669 AUROC (4.95% gain), demonstrating consistent improvements across architectures. This competitive performance is achieved while using 75x less training data than Lynx (200 vs 15,000 samples), 1000x faster inference (5ms vs 5s), and remaining fully interpretable. Crucially, we report a negative result: the Rationalization signal fails to distinguish hallucinations, suggesting that LLMs generate coherent reasoning for false premises ("Sycophancy"). This work demonstrates that domain knowledge encoded in signal architecture provides superior data efficiency compared to scaling LLM judges, achieving strong performance with lightweight (less than 1M parameter), explainable models suitable for production deployment.

Problem

Research questions and friction points this paper is trying to address.

Hallucination Detection

Large Language Models

Predictive Coding

Information Bottleneck

Model Interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predictive Coding

Information Bottleneck

Hallucination Detection