How LLMs Are Persuaded: A Few Attention Heads, Rerouted

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Large language models are susceptible to persuasion that leads them to deviate from factual responses, yet the underlying mechanisms remain poorly understood. This work addresses this gap by employing attention head interventions, latent space geometric analysis, and feature manipulation to reveal, for the first time, that the persuasion process is governed by a small subset of attention heads in intermediate layers, forming a narrow and monitorable causal circuit. The study identifies a one-dimensional critical feature responsible for evidence routing, along with its upstream generative source. Experiments across multiple open-source models and real-world adversarial scenarios demonstrate that directly modifying or removing this feature effectively steers or blocks persuasive behavior, offering a novel pathway toward enhancing model robustness.

📝 Abstract

Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model's answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model's choice, and removing it blocks persuasion. We then trace the feature back to a band of shallower attention heads that build it from persuasive keywords in the input. Every step is validated by intervention. This mechanism appears across open-source LLMs and realistic poisoning scenarios such as Generative Engine Optimization, revealing persuasion as a narrow, monitorable circuit.

Problem

Research questions and friction points this paper is trying to address.

persuasion

factual errors

attention heads

language models

AI safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention heads

persuasion mechanism

evidence routing