Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the industrial demand for compact, efficient inference models, this paper proposes a four-tier knowledge distillation framework tailored for Qwen. The framework encompasses three paradigmatic approaches: high-accuracy “slow-thinking” inference, adaptive dynamic reasoning, and reinforcement-learning–enabled reward modeling. We introduce two key innovations: (1) an adaptive thinking mechanism that dynamically adjusts inference strategies based on input complexity, and (2) a distilled reward model that enables continual reinforcement learning optimization grounded in teacher-model knowledge. By integrating knowledge distillation, dynamic inference control, and reward modeling—implemented end-to-end on Alibaba Cloud’s PAI platform—the framework achieves scalable training and deployment. Extensive evaluations across multiple benchmarks demonstrate substantial improvements in the trade-off between inference efficiency and capability retention. The framework has been deployed on PAI and is actively supporting large-scale industrial applications.

Technology Category

Application Category

📝 Abstract
Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.
Problem

Research questions and friction points this paper is trying to address.

Developing small efficient reasoning models for real-world applications
Creating distilled models balancing reasoning performance and inference speed
Providing scalable training and inference solutions for industry
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilled slow-thinking models for high accuracy
Adaptive-thinking models dynamically adjust reasoning strategies
Distilled reward models enable reinforcement learning
🔎 Similar Papers
No similar papers found.
Wenrui Cai
Wenrui Cai
State Key Laboratory of Virtual Reality Technology and System, Beihang University
Computer VisionVideo AnalysisLLMs
Chengyu Wang
Chengyu Wang
Alibaba Group
Natural Language ProcessingLarge Language ModelMulti-modal Learning
J
Junbing Yan
Alibaba Cloud Computing, Hangzhou, China
J
Jun Huang
Alibaba Cloud Computing, Hangzhou, China
X
Xiangzhong Fang
Shanghai Jiao Tong University, Shanghai, China