Transcoder Adapters for Reasoning-Model Diffing

๐Ÿ“… 2026-02-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

187K/year
๐Ÿค– AI Summary
This study investigates how reasoning training alters the internal mechanisms of large language models, with a focus on changes in multilayer perceptron (MLP) computations. To this end, the authors propose the โ€œtranscoder adapterโ€ method, which uses sparse, interpretable features to approximate differences in MLP behavior before and after fine-tuning, validated on Qwen2.5-Math-7B and its reasoning-distilled variant. The approach reveals, for the first time, that reasoning fine-tuning specifically influences internal behaviors such as hesitation-word generation: only about 2.4% of features dominate this behavior, and their removal significantly shortens response length without compromising accuracy. Moreover, the adapter reproduces 50โ€“90% of the reasoning performance gains and accurately matches response lengths. Integrating attribution graphs, feature activation tracing, and ablation studies, this work offers a novel perspective on the internal mechanisms underlying reasoning training.

Technology Category

Application Category

๐Ÿ“ Abstract
While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior -- the production of hesitation tokens (e.g.,"wait"). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These features are necessary and sufficient for producing hesitation tokens; removing them reduces response length, often without affecting accuracy. Overall, our results provide insight into reasoning training and suggest transcoder adapters may be useful for studying fine-tuning more broadly.
Problem

Research questions and friction points this paper is trying to address.

reasoning models
fine-tuning
internal mechanisms
model diffing
interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

transcoder adapters
reasoning models
interpretable features
model diffing
fine-tuning analysis
๐Ÿ”Ž Similar Papers
No similar papers found.