Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the industrial demand for compact, efficient inference models, this paper proposes a four-tier knowledge distillation framework tailored for Qwen. The framework encompasses three paradigmatic approaches: high-accuracy “slow-thinking” inference, adaptive dynamic reasoning, and reinforcement-learning–enabled reward modeling. We introduce two key innovations: (1) an adaptive thinking mechanism that dynamically adjusts inference strategies based on input complexity, and (2) a distilled reward model that enables continual reinforcement learning optimization grounded in teacher-model knowledge. By integrating knowledge distillation, dynamic inference control, and reward modeling—implemented end-to-end on Alibaba Cloud’s PAI platform—the framework achieves scalable training and deployment. Extensive evaluations across multiple benchmarks demonstrate substantial improvements in the trade-off between inference efficiency and capability retention. The framework has been deployed on PAI and is actively supporting large-scale industrial applications.

Technology Category

Application Category

📝 Abstract

Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.

Problem

Research questions and friction points this paper is trying to address.

Developing small efficient reasoning models for real-world applications

Creating distilled models balancing reasoning performance and inference speed

Providing scalable training and inference solutions for industry

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilled slow-thinking models for high accuracy

Adaptive-thinking models dynamically adjust reasoning strategies

Distilled reward models enable reinforcement learning

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting