🤖 AI Summary
This work addresses the challenge of detecting and mitigating context hallucinations—i.e., generations unsupported by input context—in large language models (LLMs). We propose a generator-agnostic, general-purpose observational framework. Methodologically, we construct linear probes over residual streams and discover, for the first time, a low-dimensional, cross-model transferable linear direction that enables high-accuracy hallucination detection with only a single forward pass. By integrating gradient×activation localization with sparse late-layer MLP analysis, we establish a causal link between this direction and specific MLP subcircuits, enabling active regulation of hallucination rates. Our approach achieves robust, parameter-scale-invariant detection across the Gemma-2 family, outperforming baselines by 5–27 percentage points and demonstrating mid-layer generalization. To standardize evaluation, we release ContraTales, a benchmark comprising 2,000 carefully curated samples for hallucination mitigation assessment.
📝 Abstract
Contextual hallucinations -- statements unsupported by given context -- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.