Endogenous Resistance to Activation Steering in Language Models

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study identifies and formally names a previously undocumented phenomenon—Endogenous Steering Resistance (ESR)—where large language models spontaneously recover consistent generation under activation steering interventions. The authors reveal the existence of an internal consistency-checking circuit within the model architecture. Leveraging sparse autoencoders (SAEs), they pinpoint 26 latent variables causally linked to ESR in Llama-3.3-70B. Through zero ablation, causal mediation analysis, meta-prompting, and self-correction fine-tuning, they demonstrate that ablating these variables reduces multi-attempt consistency by 25%, whereas meta-prompting enhances it fourfold. Notably, the authors successfully induce ESR-like behavior in smaller models, enabling active control over this intrinsic resistance mechanism and offering new insights into the internal robustness of language models.

Technology Category

Application Category

📝 Abstract

Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.

Problem

Research questions and friction points this paper is trying to address.

Endogenous Steering Resistance

activation steering

language models

internal consistency

adversarial manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Endogenous Steering Resistance

activation steering

sparse autoencoder