🤖 AI Summary
This work addresses the learning difficulties and “overthinking” bias encountered when distilling large reasoning models (LRMs) with chain-of-thought (CoT) reasoning into smaller student models. Methodologically, we propose a novel tree-structured CoT data construction and collaborative training paradigm: (1) a Monte Carlo Tree Search (MCTS)-based framework for generating hierarchical, branching CoT trajectories; (2) a Thoughts Length Balance mechanism to mitigate length-induced bias in reasoning depth; and (3) a multi-objective post-training objective integrating fine-grained Direct Preference Optimization (DPO) with supervised fine-tuning (SFT). Compared to conventional SFT- or RL-based distillation, our approach significantly reduces student model overfitting to redundant reasoning steps. Empirically, it achieves near-teacher performance on both mathematical and general reasoning benchmarks—demonstrating substantial improvements in distillation efficiency and out-of-distribution generalization.
📝 Abstract
Large Reasoning Models(LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought(CoT). Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e. over-thinking) when using Supervised Fine-tuning(SFT) and Reinforcement Learning(RL) methods. To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search(MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the construted data.