DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the high deployment cost and low inference efficiency of large language models (LLMs) in resource-constrained settings, this paper proposes a multi-agent collaborative knowledge distillation framework to develop DistilQwen2.5—a lightweight, open-source model. Our method introduces a novel teacher-agent division-of-labor mechanism—comprising instruction-pair filtering, rewriting, and refinement agents—combined with instruction tuning and fine-grained hidden-state knowledge integration. This enables the student model to surpass its teacher in capability. DistilQwen2.5 consistently outperforms the original Qwen2.5 base model across multiple general-purpose and domain-specific benchmarks. It achieves ~2.3× faster inference speed and reduces GPU memory consumption by 45%. The full model weights, training code, and evaluation scripts are publicly released. Furthermore, DistilQwen2.5 has been successfully deployed and validated in multiple industrial applications, demonstrating practical viability and scalability.

Technology Category

Application Category

📝 Abstract

Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource-constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction-following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi-agent teachers to select, rewrite, and refine instruction-response pairs that are more suitable for student LLMs to learn. After standard fine-tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine-grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real-world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.

Problem

Research questions and friction points this paper is trying to address.

Enhancing computational efficiency of large language models

Reducing deployment costs for resource-constrained scenarios

Improving instruction-following capabilities via distillation techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent teachers refine instruction-response pairs

Model fusion integrates fine-grained hidden knowledge

Distilled models outperform original checkpoints significantly

🔎 Similar Papers

No similar papers found.