Steered LLM Activations are Non-Surjective

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

235K/year
🤖 AI Summary
This study investigates whether internal states of a language model, after activation intervention, can be reproduced through forward passes induced by arbitrary natural-language prompts. Formalizing this question as the surjectivity of prompt-induced state manifolds onto intervened states, we theoretically prove for the first time that activation interventions typically displace the residual stream outside the manifold of states reachable by any natural prompt, thereby rigorously delineating the capability boundaries between white-box interventions and black-box prompting. Leveraging the manifold hypothesis and probabilistic arguments, we empirically validate this theory across three prominent large language models, demonstrating that nearly all intervened states are irreproducible by any natural prompt. These findings indicate that successful activation interventions do not imply the existence of corresponding interpretable prompts or model vulnerabilities, revealing a fundamental separation between intervention and prompting mechanisms.

Technology Category

Application Category

📝 Abstract
Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.
Problem

Research questions and friction points this paper is trying to address.

activation steering
surjectivity
large language models
prompt realizability
interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

activation steering
surjectivity
residual stream manifold
white-box control
prompt realizability
🔎 Similar Papers
2024-06-20arXiv.orgCitations: 26