How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the performance degradation of student models in supervised fine-tuning caused by style distribution mismatch when using synthetic data generated by strong teacher models. To mitigate this issue, the authors propose TESSY, a novel framework that introduces a teacher–student collaborative mechanism into the data synthesis process. TESSY alternately generates style and content tokens, effectively decoupling and then fusing the teacher’s advanced reasoning capabilities with the student’s linguistic style to produce training data that leverages the strengths of both. Experimental results demonstrate that, when applied to Qwen3-8B, TESSY improves performance by 11.25% on LiveCodeBench-Pro and 6.68% on OJBench compared to conventional teacher-generated data, successfully balancing reasoning ability and stylistic consistency.

Technology Category

Application Category

📝 Abstract
A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.
Problem

Research questions and friction points this paper is trying to address.

reasoning model
supervised fine-tuning
synthetic data
stylistic divergence
teacher-student framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-Student Cooperation
Synthetic Data
Stylistic Consistency
Supervised Fine-Tuning
Reasoning Models