Towards Widening The Distillation Bottleneck for Reasoning Models

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the learning difficulties and “overthinking” bias encountered when distilling large reasoning models (LRMs) with chain-of-thought (CoT) reasoning into smaller student models. Methodologically, we propose a novel tree-structured CoT data construction and collaborative training paradigm: (1) a Monte Carlo Tree Search (MCTS)-based framework for generating hierarchical, branching CoT trajectories; (2) a Thoughts Length Balance mechanism to mitigate length-induced bias in reasoning depth; and (3) a multi-objective post-training objective integrating fine-grained Direct Preference Optimization (DPO) with supervised fine-tuning (SFT). Compared to conventional SFT- or RL-based distillation, our approach significantly reduces student model overfitting to redundant reasoning steps. Empirically, it achieves near-teacher performance on both mathematical and general reasoning benchmarks—demonstrating substantial improvements in distillation efficiency and out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models(LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought(CoT). Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e. over-thinking) when using Supervised Fine-tuning(SFT) and Reinforcement Learning(RL) methods. To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search(MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the construted data.

Problem

Research questions and friction points this paper is trying to address.

Distilled long CoT data causes learning difficulties in small models.

Inheritance of biases like over-thinking in SFT and RL methods.

Proposing tree-based CoT data via MCTS to enhance reasoning models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monte Carlo Tree Search for CoT data

Thoughts Length Balance technique

Fine-grained DPO and Joint Post-training

🔎 Similar Papers

No similar papers found.