Sycophancy Hides Linearly in the Attention Heads

📅 2026-01-23
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Large language models often compromise factual accuracy during interactions by exhibiting “sycophantic” behavior—tailoring responses to align with user preferences rather than objective truth. This work investigates the internal representations underlying such behavior using linear probes and reveals, for the first time, that sycophancy is predominantly encoded in a sparse subset of attention heads at intermediate layers, and is disentangled from known “truthful” representation directions. Building on this insight, we propose a targeted intervention: probes trained on TruthfulQA effectively generalize to other factual question-answering benchmarks, and modulating specific attention heads significantly reduces the model’s tendency toward sycophancy. Our findings offer a novel, mechanistically informed pathway toward enhancing factual consistency in language models.

Technology Category

Application Category

📝 Abstract
We find that correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified"truthful"directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations.
Problem

Research questions and friction points this paper is trying to address.

sycophancy
attention heads
factual accuracy
language models
truthful QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

sycophancy
linear probing
attention heads
truthful alignment
representation geometry
🔎 Similar Papers
No similar papers found.