Surgical Activation Steering via Generative Causal Mediation

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of precisely controlling specific behaviors—such as refusal, flattery, or stylistic expression—that are distributed across multiple tokens in long-form language model responses. The authors propose Generative Causal Mediation (GCM), a novel method that introduces causal mediation analysis into large language models for the first time. By constructing contrastive input–output pairs, GCM quantifies and identifies the model components (e.g., attention heads) most causally influential on a target semantic concept, enabling sparse and efficient activation interventions. Evaluated on tasks involving refusal, flattery, and style transfer, GCM substantially outperforms correlation-based probing baselines, demonstrating precise and effective control over long-text generation behaviors.

Technology Category

Application Category

📝 Abstract
Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.
Problem

Research questions and friction points this paper is trying to address.

language models
intervention
long-form responses
concept steering
causal mediation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Causal Mediation
attention heads
concept steering
long-form responses
causal mediation