How LLMs Are Persuaded: A Few Attention Heads, Rerouted

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
Large language models are susceptible to persuasion that leads them to deviate from factual responses, yet the underlying mechanisms remain poorly understood. This work addresses this gap by employing attention head interventions, latent space geometric analysis, and feature manipulation to reveal, for the first time, that the persuasion process is governed by a small subset of attention heads in intermediate layers, forming a narrow and monitorable causal circuit. The study identifies a one-dimensional critical feature responsible for evidence routing, along with its upstream generative source. Experiments across multiple open-source models and real-world adversarial scenarios demonstrate that directly modifying or removing this feature effectively steers or blocks persuasive behavior, offering a novel pathway toward enhancing model robustness.
📝 Abstract
Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model's answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model's choice, and removing it blocks persuasion. We then trace the feature back to a band of shallower attention heads that build it from persuasive keywords in the input. Every step is validated by intervention. This mechanism appears across open-source LLMs and realistic poisoning scenarios such as Generative Engine Optimization, revealing persuasion as a narrow, monitorable circuit.
Problem

Research questions and friction points this paper is trying to address.

persuasion
factual errors
attention heads
language models
AI safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

attention heads
persuasion mechanism
evidence routing
latent polyhedron
intervention analysis