Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates efficient distillation of large language models’ (LLMs) chain-of-thought (CoT) reasoning capabilities into small language models (SLMs), balancing computational efficiency and reasoning performance. We conduct large-scale controlled experiments across seven mathematical and commonsense reasoning benchmarks, systematically varying four teacher LLMs, seven student architectures, CoT granularity levels, CoT formatting strategies, and teacher selection criteria. Key findings reveal: (i) a non-monotonic granularity effect in CoT distillation—neither finest nor coarsest granularities yield optimal performance; (ii) minimal impact of CoT formatting on student outcomes; and (iii) no positive correlation between teacher model strength and student performance, highlighting the need to balance teacher diversity and reasoning complexity. Based on these insights, we propose a student-adaptive CoT distillation strategy that significantly improves SLM generalization and stability across multi-task reasoning. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at https://github.com/EIT-NLP/Distilling-CoT-Reasoning.
Problem

Research questions and friction points this paper is trying to address.

Identify factors for CoT distillation.
Examine granularity, format, and teacher models.
Optimize CoT strategies for small language models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling CoT into SLMs
Examining granularity and format
Tailoring CoT strategies specifically
🔎 Similar Papers
No similar papers found.
X
Xinghao Chen
Department of Computing, The Hong Kong Polytechnic University; Ningbo Digital Twin Institute, Eastern Institute of Technology, Ningbo, China
Z
Zhijing Sun
Ningbo Digital Twin Institute, Eastern Institute of Technology, Ningbo, China
W
Wenjin Guo
Ningbo Digital Twin Institute, Eastern Institute of Technology, Ningbo, China
Miaoran Zhang
Miaoran Zhang
Saarland University, Saarland Informatics Campus
Machine LearningNatural Language ProcessingRepresentation Learning
Yanjun Chen
Yanjun Chen
University of Illinois Urbana-Champaign
Human Computer InteractionHaptics
Y
Yirong Sun
Ningbo Digital Twin Institute, Eastern Institute of Technology, Ningbo, China
H
Hui Su
Meituan Inc.
Yijie Pan
Yijie Pan
Ningbo Digital Twin Institute, Eastern Institute of Technology, Ningbo, China
Dietrich Klakow
Dietrich Klakow
Saarland University, Saarland Informatics Campus, PharmaScienceHub
Natural Language ProcessingSpeech ProcessingQuestion AnsweringMachine Learning
W
Wenjie Li
Department of Computing, The Hong Kong Polytechnic University
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning