🤖 AI Summary
This work addresses the challenge of cost-effectively eliciting reasoning capabilities in large language models (LLMs). We propose SAE-Tuning: first, a sparse autoencoder (SAE) is employed to disentangle interpretable reasoning representations from a source model; then, these representations serve as supervision signals for lightweight supervised fine-tuning of a target model—requiring only verification-style QA data, without costly reasoning trace annotations. To our knowledge, this is the first method enabling modular reuse and plug-and-play cross-model transfer of reasoning capabilities. On AIME24 and AMC23, it achieves 43.33% and 90% Pass@1, respectively, while preserving over 97% of the original RL-based reasoning performance. Training costs are reduced to approximately $1 and 20 minutes—over 2,000× cheaper and 450× faster than baseline methods. All code and models are publicly released.
📝 Abstract
How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains>97% of its RL-trained counterpart's reasoning performance while reducing training costs by>2000x to roughly $1 and training time by>450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around $1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.