Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Full fine-tuning of language models for reasoning tasks often degrades interpretability by entangling parameters across layers. Method: To preserve transparency while achieving comparable performance, we propose injecting lightweight additive steering vectors into the residual stream—bypassing full parameter updates. We analyze the resulting mechanisms via logit-lens probing, path patching, and circuit analysis. Contribution/Results: We find that steering vectors at the final layer primarily bias first-token generation, whereas those at the penultimate layer selectively amplify MLP and unembedding-layer responses to keywords and structural tokens. Our method reproduces full fine-tuning performance on two mainstream LMs with minimal computational overhead. Crucially, it establishes the first interpretable framework linking intervention signals, layer-specific computational mechanisms, and downstream reasoning behavior—enabling both mechanistic understanding and controllable steering of reasoning capabilities.

Technology Category

Application Category

📝 Abstract
The mechanisms by which reasoning training reshapes language-model computations remain poorly understood. We study lightweight steering vectors inserted into the base model's residual stream and trained with a reinforcement-learning objective, which can match full fine-tuning performance while retaining the interpretability of small, additive interventions. Using logit-lens readouts, path patching, and circuit analyses, we analyze two models and find: (i) the last-layer steering vector behaves like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as "To" and "Step"; and (ii) the penultimate-layer steering vector leaves attention patterns largely unchanged and instead acts through the MLP and unembedding, preferentially up-weighting process words and structure symbols. These results establish a principled framework for interpreting the behavioral changes induced by reasoning training.
Problem

Research questions and friction points this paper is trying to address.

Mechanisms of reasoning training in language models
Interpretability of reinforcement learning steering vectors
Behavioral changes from reasoning training interventions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning trained lightweight steering vectors
Steering vectors match fine-tuning with interpretability
Logit-lens and path patching analyze model mechanisms
🔎 Similar Papers