Small Models Struggle to Learn from Strong Reasoners

๐Ÿ“… 2025-02-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Small language models (โ‰ค3B parameters) exhibit a significant learnability gap when distilling long-chain chain-of-thought (CoT) reasoning capabilities from large models, struggling to internalize complex reasoning patterns. This work is the first to identify and systematically analyze this phenomenon. We propose Mix Distillationโ€”a complexity-aware distillation framework that dynamically matches small-model capacity by mixing short and long CoT samples, sourced heterogeneously from multiple teacher models according to reasoning complexity. Our method integrates CoT distillation, multi-source mixed sampling, supervised fine-tuning, and cross-scale collaborative training, departing from conventional unidirectional strong-to-weak knowledge transfer. On benchmarks including GSM8K and MMLU, Mix Distillation improves small-model reasoning accuracy by 4.2โ€“7.8 percentage points over baselines using only long or only short CoT chains, demonstrating both effectiveness and generalizability of complexity-aware distillation.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.
Problem

Research questions and friction points this paper is trying to address.

Small models fail with long reasoning chains.
Mix Distillation improves small model reasoning.
Adapting reasoning complexity enhances capability transfer.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mix Distillation strategy
Balancing reasoning complexity
Improving small model performance