Endogenous Resistance to Activation Steering in Language Models

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies and formally names a previously undocumented phenomenon—Endogenous Steering Resistance (ESR)—where large language models spontaneously recover consistent generation under activation steering interventions. The authors reveal the existence of an internal consistency-checking circuit within the model architecture. Leveraging sparse autoencoders (SAEs), they pinpoint 26 latent variables causally linked to ESR in Llama-3.3-70B. Through zero ablation, causal mediation analysis, meta-prompting, and self-correction fine-tuning, they demonstrate that ablating these variables reduces multi-attempt consistency by 25%, whereas meta-prompting enhances it fourfold. Notably, the authors successfully induce ESR-like behavior in smaller models, enabling active control over this intrinsic resistance mechanism and offering new insights into the internal robustness of language models.

Technology Category

Application Category

📝 Abstract
Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.
Problem

Research questions and friction points this paper is trying to address.

Endogenous Steering Resistance
activation steering
language models
internal consistency
adversarial manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Endogenous Steering Resistance
activation steering
sparse autoencoder
self-correction
internal consistency-checking
🔎 Similar Papers
No similar papers found.
Alex McKenzie
Alex McKenzie
AE Studio
AI SafetyAI AlignmentMechanistic InterpretabilityAI Safety Evaluations
K
Keenan Pepper
AE Studio
S
Stijn Servaes
AE Studio
Martin Leitgab
Martin Leitgab
Unknown affiliation
Murat Cubuktepe
Murat Cubuktepe
University of Texas at Austin
Formal MethodsConvex OptimizationReinforcement LearningRobotics
M
Mike Vaiana
AE Studio
D
Diogo de Lucena
AE Studio
J
Judd Rosenblatt
AE Studio
M
Michael S. A. Graziano
Princeton Neuroscience Institute & Department of Psychology, Princeton University