🤖 AI Summary
To address the industrial demand for compact, efficient inference models, this paper proposes a four-tier knowledge distillation framework tailored for Qwen. The framework encompasses three paradigmatic approaches: high-accuracy “slow-thinking” inference, adaptive dynamic reasoning, and reinforcement-learning–enabled reward modeling. We introduce two key innovations: (1) an adaptive thinking mechanism that dynamically adjusts inference strategies based on input complexity, and (2) a distilled reward model that enables continual reinforcement learning optimization grounded in teacher-model knowledge. By integrating knowledge distillation, dynamic inference control, and reward modeling—implemented end-to-end on Alibaba Cloud’s PAI platform—the framework achieves scalable training and deployment. Extensive evaluations across multiple benchmarks demonstrate substantial improvements in the trade-off between inference efficiency and capability retention. The framework has been deployed on PAI and is actively supporting large-scale industrial applications.
📝 Abstract
Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.