Not All Correct Answers Are Equal: Why Your Distillation Source Matters

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work investigates how teacher model selection impacts the performance of open-source student models in reasoning-oriented knowledge distillation. Method: Leveraging 1.89 million queries, we construct three parallel distillation datasets and systematically evaluate output quality across state-of-the-art teachers—AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1—revealing, for the first time, significant divergence in the quality of their reasoning traces—even when answers are correct. We propose filtering based on “high-quality verification reasoning traces” (beyond answer correctness), validated via distributional analysis, perplexity estimation, and adaptive response modeling. Contribution/Results: A student model trained on AM-Thinking-v1 distilled data achieves new SOTA across AIME2024 (84.3), AIME2025 (72.2), MATH500 (98.4), and LiveCodeBench (65.9), while demonstrating task-difficulty-driven adaptive response length.

Technology Category

Application Category

📝 Abstract

Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The AM-based model consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Facefootnote{Datasets are available on Hugging Face: href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.

Problem

Research questions and friction points this paper is trying to address.

Evaluating impact of distillation sources on reasoning model performance

Analyzing dataset diversity and perplexity from different teacher models

Demonstrating adaptive output behavior in student models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distillation from verified outputs of top models

Large-scale empirical study on reasoning data

AM-Thinking-v1 data shows best performance

🔎 Similar Papers

No similar papers found.