A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of detecting and mitigating context hallucinations—i.e., generations unsupported by input context—in large language models (LLMs). We propose a generator-agnostic, general-purpose observational framework. Methodologically, we construct linear probes over residual streams and discover, for the first time, a low-dimensional, cross-model transferable linear direction that enables high-accuracy hallucination detection with only a single forward pass. By integrating gradient×activation localization with sparse late-layer MLP analysis, we establish a causal link between this direction and specific MLP subcircuits, enabling active regulation of hallucination rates. Our approach achieves robust, parameter-scale-invariant detection across the Gemma-2 family, outperforming baselines by 5–27 percentage points and demonstrating mid-layer generalization. To standardize evaluation, we release ContraTales, a benchmark comprising 2,000 carefully curated samples for hallucination mitigation assessment.

Technology Category

Application Category

📝 Abstract

Contextual hallucinations -- statements unsupported by given context -- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.

Problem

Research questions and friction points this paper is trying to address.

Detects contextual hallucinations in AI-generated text

Identifies transferable linear direction for hallucination separation

Causally manipulates hallucination rates via specific MLP activity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Observer model detects hallucinations via residual probe

Linear direction separates hallucinated from faithful text

Gradient-times-activation localizes signal to MLP activity

🔎 Similar Papers

No similar papers found.