Transcoder Adapters for Reasoning-Model Diffing

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study investigates how reasoning training alters the internal mechanisms of large language models, with a focus on changes in multilayer perceptron (MLP) computations. To this end, the authors propose the “transcoder adapter” method, which uses sparse, interpretable features to approximate differences in MLP behavior before and after fine-tuning, validated on Qwen2.5-Math-7B and its reasoning-distilled variant. The approach reveals, for the first time, that reasoning fine-tuning specifically influences internal behaviors such as hesitation-word generation: only about 2.4% of features dominate this behavior, and their removal significantly shortens response length without compromising accuracy. Moreover, the adapter reproduces 50–90% of the reasoning performance gains and accurately matches response lengths. Integrating attribution graphs, feature activation tracing, and ablation studies, this work offers a novel perspective on the internal mechanisms underlying reasoning training.

Technology Category

Application Category

📝 Abstract

While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior -- the production of hesitation tokens (e.g.,"wait"). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These features are necessary and sufficient for producing hesitation tokens; removing them reduces response length, often without affecting accuracy. Overall, our results provide insight into reasoning training and suggest transcoder adapters may be useful for studying fine-tuning more broadly.

Problem

Research questions and friction points this paper is trying to address.

reasoning models

fine-tuning

internal mechanisms

model diffing

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

transcoder adapters

reasoning models

interpretable features