🤖 AI Summary
A systematic benchmark for evaluating how data operations—such as augmentation, selection, and mixing—affect student model reasoning capabilities in Chain-of-Thought (CoT) distillation remains absent.
Method: We introduce DC-CoT, the first data-centric benchmark tailored for CoT distillation, enabling unified evaluation across methods, models, and data. It supports multi-teacher setups (e.g., o4-mini, Gemini-Pro) and multi-student architectures (3B/7B), and assesses both in-distribution (IID) and out-of-distribution (OOD) generalization, as well as cross-domain transfer. Leveraging high-quality, multi-source CoT data, DC-CoT integrates standardized benchmarks (GSM8K, MMLU, DROP) and controlled distribution-shift experiments.
Contribution/Results: DC-CoT reveals, for the first time, the nonlinear impact of data operations on reasoning performance and identifies optimal strategies for enhancing OOD robustness. We publicly release the dataset and evaluation code to advance the practical deployment of lightweight reasoning models.
📝 Abstract
Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is shared in https://anonymous.4open.science/r/DC-COT-FF4C/.