MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Small language models (SLMs) suffer from a pronounced “learnability gap” in acquiring long-chain chain-of-thought (CoT) reasoning due to inherent capacity limitations. To address this, we propose MiCoTA—a novel framework that introduces medium-sized models as teacher assistants and employs medium-length CoT rationales as distillation intermediaries, thereby bridging the mismatch between model capacity and reasoning depth. Methodologically, MiCoTA integrates distribution-aligned data selection, multi-stage knowledge distillation, and intermediate-length CoT distillation to enhance knowledge transfer efficiency. Evaluated on rigorous mathematical reasoning benchmarks—including AIME2024 and AMC—MiCoTA yields average improvements of +3.47 and +3.93 points for Qwen2.5-7B and Qwen2.5-3B, respectively, substantially strengthening their long-range reasoning capabilities. Our core contribution lies in both identifying and mitigating the CoT learnability bottleneck in SLMs, establishing a new paradigm for efficient, lightweight reasoning.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the "SLMs Learnability Gap". To address this, we introduce extbf{Mi}d- extbf{Co}T extbf{T}eacher extbf{A}ssistant Distillation (MiCoTAl), a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve significant improvements in reasoning performance. Specifically, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct achieve an improvement of 3.47 and 3.93 respectively on average score on AIME2024, AMC, Olympiad, MATH-500 and GSM8K benchmarks. To better understand the mechanism behind MiCoTA, we perform a quantitative experiment demonstrating that our method produces data more closely aligned with base SLM distributions. Our insights pave the way for future research into long-CoT data distillation for SLMs.

Problem

Research questions and friction points this paper is trying to address.

Bridging learnability gap in small language models

Improving long CoT distillation for SLMs

Enhancing reasoning performance with intermediate models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses intermediate-sized models as teacher assistants

Employs intermediate-length CoT sequences

Improves SLM reasoning via distillation alignment

🔎 Similar Papers

Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise