Reasoning-Finetuning Repurposes Latent Representations in Base Models

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

It remains unclear whether reasoning fine-tuning endows large language models with *de novo* capabilities or instead reuses pre-existing latent representations to elicit behaviors such as backtracking. Method: We construct steering vectors within the residual stream and integrate attribution analysis to systematically identify directional representations that are pre-embedded—but dormant—in the base model; we then examine how fine-tuning repurposes, rather than creates, these representations. Contribution/Results: Empirical evaluation on DeepSeek-R1-Distill-Llama-8B demonstrates that specific steering vectors reliably induce backtracking behavior, despite the corresponding directions exhibiting no backtracking functionality in the base model—confirming selective reuse. This work provides the first evidence that reasoning fine-tuning operates primarily through *representation reuse* and *functional redirection*, not representation emergence. By revealing how fine-tuning reconfigures pre-existing representational subspaces into new behavioral circuits, our findings offer a novel mechanistic account of capability emergence in foundation models.

Technology Category

Application Category

📝 Abstract

Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities. Prior work has succeeded in manipulating this behavior via steering vectors, but the underlying mechanism remains poorly understood. In this work, we show that the emergence of backtracking in DeepSeek-R1-Distill-Llama-8B is in part driven by a repurposed direction already present in base model activations. Specifically, we identify a direction in base Llama-3.1-8B's residual stream which systematically induces backtracking when used to steer the distilled reasoning model, and find that the effects of steering with this direction cannot be trivially explained by token-level attributes. We further find that this direction does not induce backtracking in the base model, suggesting that the reasoning finetuning process repurposes pre-existing representations to form new behavioral circuits. Additionally, we hypothesize that this direction is one of several which may work together to mediate backtracking. Our findings offer a compelling picture that reasoning-finetuned models repurpose pre-existing base model representations, rather than learn new capabilities from scratch.

Problem

Research questions and friction points this paper is trying to address.

Understanding how reasoning fine-tuning repurposes latent representations

Identifying pre-existing directions in base models that induce backtracking

Exploring mechanisms behind enhanced capabilities in reasoning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposes base model latent representations

Uses steering vectors for backtracking control

Identifies pre-existing behavioral circuit directions

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting