Resa: Transparent Reasoning Models via SAEs

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of cost-effectively eliciting reasoning capabilities in large language models (LLMs). We propose SAE-Tuning: first, a sparse autoencoder (SAE) is employed to disentangle interpretable reasoning representations from a source model; then, these representations serve as supervision signals for lightweight supervised fine-tuning of a target model—requiring only verification-style QA data, without costly reasoning trace annotations. To our knowledge, this is the first method enabling modular reuse and plug-and-play cross-model transfer of reasoning capabilities. On AIME24 and AMC23, it achieves 43.33% and 90% Pass@1, respectively, while preserving over 97% of the original RL-based reasoning performance. Training costs are reduced to approximately $1 and 20 minutes—over 2,000× cheaper and 450× faster than baseline methods. All code and models are publicly released.

Technology Category

Application Category

📝 Abstract

How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains>97% of its RL-trained counterpart's reasoning performance while reducing training costs by>2000x to roughly $1 and training time by>450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around $1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Elicit strong reasoning in LMs cost-effectively via representations

Retain reasoning performance while reducing training costs significantly

Enable generalizable and modular reasoning abilities across different models

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAE-Tuning for efficient reasoning model training

Cost-effective training with minimal performance loss

Generalizable and modular reasoning abilities extraction

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting