Towards Widening The Distillation Bottleneck for Reasoning Models

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the learning difficulties and “overthinking” bias encountered when distilling large reasoning models (LRMs) with chain-of-thought (CoT) reasoning into smaller student models. Methodologically, we propose a novel tree-structured CoT data construction and collaborative training paradigm: (1) a Monte Carlo Tree Search (MCTS)-based framework for generating hierarchical, branching CoT trajectories; (2) a Thoughts Length Balance mechanism to mitigate length-induced bias in reasoning depth; and (3) a multi-objective post-training objective integrating fine-grained Direct Preference Optimization (DPO) with supervised fine-tuning (SFT). Compared to conventional SFT- or RL-based distillation, our approach significantly reduces student model overfitting to redundant reasoning steps. Empirically, it achieves near-teacher performance on both mathematical and general reasoning benchmarks—demonstrating substantial improvements in distillation efficiency and out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models(LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought(CoT). Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e. over-thinking) when using Supervised Fine-tuning(SFT) and Reinforcement Learning(RL) methods. To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search(MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the construted data.
Problem

Research questions and friction points this paper is trying to address.

Distilled long CoT data causes learning difficulties in small models.
Inheritance of biases like over-thinking in SFT and RL methods.
Proposing tree-based CoT data via MCTS to enhance reasoning models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monte Carlo Tree Search for CoT data
Thoughts Length Balance technique
Fine-grained DPO and Joint Post-training
🔎 Similar Papers
No similar papers found.
H
Huifeng Yin
Alibaba International Digital Commerce, Tsinghua University
Y
Yu Zhao
Alibaba International Digital Commerce
M
Minghao Wu
Alibaba International Digital Commerce, Monash University
X
Xuanfan Ni
Alibaba International Digital Commerce
Bo Zeng
Bo Zeng
University of Pittsburgh
H
Hao Wang
Alibaba International Digital Commerce
T
Tianqi Shi
Alibaba International Digital Commerce
L
Liangying Shao
Alibaba International Digital Commerce
Chenyang Lyu
Chenyang Lyu
Alibaba
Large Language ModelsNatural Language ProcessingMachine Learning
Longyue Wang
Longyue Wang
Alibaba International
Large Language ModelMachine TranslationNatural Language ProcessingLanguange Agent
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence
Kaifu Zhang
Kaifu Zhang
Assistant Professor of Marketing, Carnegie Mellon University
Two-sided marketsInternet platformse-commerce