The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

A systematic benchmark for evaluating how data operations—such as augmentation, selection, and mixing—affect student model reasoning capabilities in Chain-of-Thought (CoT) distillation remains absent. Method: We introduce DC-CoT, the first data-centric benchmark tailored for CoT distillation, enabling unified evaluation across methods, models, and data. It supports multi-teacher setups (e.g., o4-mini, Gemini-Pro) and multi-student architectures (3B/7B), and assesses both in-distribution (IID) and out-of-distribution (OOD) generalization, as well as cross-domain transfer. Leveraging high-quality, multi-source CoT data, DC-CoT integrates standardized benchmarks (GSM8K, MMLU, DROP) and controlled distribution-shift experiments. Contribution/Results: DC-CoT reveals, for the first time, the nonlinear impact of data operations on reasoning performance and identifies optimal strategies for enhancing OOD robustness. We publicly release the dataset and evaluation code to advance the practical deployment of lightweight reasoning models.

Technology Category

Application Category

📝 Abstract

Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is shared in https://anonymous.4open.science/r/DC-COT-FF4C/.

Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive benchmark for data-centric CoT distillation evaluation

Need to assess data manipulation impact on student LLM performance

Optimizing CoT distillation for better reasoning model accessibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric benchmark for CoT distillation

Evaluates data manipulation impact systematically

Optimizes distillation via actionable insights

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting