Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inherent trade-off between enhanced reasoning capability and degraded general-purpose competencies—such as instruction following, tool use, and knowledge retention—in lightweight inference models, this paper proposes Ring-Lite-Distill: a novel 2.75B-activated-parameter Mixture-of-Experts (MoE) architecture, Ling-Lite, which is the first lightweight MoE framework to systematically balance multi-difficulty reasoning coverage with holistic general-purpose capability optimization. Methodologically, it integrates curriculum-based high-quality data distillation, multi-objective joint training, and capability-balancing regularization. Experiments demonstrate that Ring-Lite-Distill matches the reasoning performance of DeepSeek-R1-Distill-Qwen-7B while substantially outperforming it in general-purpose capabilities—thereby breaking the long-standing “strong reasoning, weak generalization” dilemma for the first time. The model is publicly released.

Technology Category

Application Category

📝 Abstract
This technical report presents Ring-Lite-Distill, a lightweight reasoning model derived from our open-source Mixture-of-Experts (MoE) Large Language Models (LLMs) Ling-Lite. This study demonstrates that through meticulous high-quality data curation and ingenious training paradigms, the compact MoE model Ling-Lite can be further trained to achieve exceptional reasoning capabilities, while maintaining its parameter-efficient architecture with only 2.75 billion activated parameters, establishing an efficient lightweight reasoning architecture. In particular, in constructing this model, we have not merely focused on enhancing advanced reasoning capabilities, exemplified by high-difficulty mathematical problem solving, but rather aimed to develop a reasoning model with more comprehensive competency coverage. Our approach ensures coverage across reasoning tasks of varying difficulty levels while preserving generic capabilities, such as instruction following, tool use, and knowledge retention. We show that, Ring-Lite-Distill's reasoning ability reaches a level comparable to DeepSeek-R1-Distill-Qwen-7B, while its general capabilities significantly surpass those of DeepSeek-R1-Distill-Qwen-7B. The models are accessible at https://huggingface.co/inclusionAI
Problem

Research questions and friction points this paper is trying to address.

Develop compact MoE model with comprehensive reasoning capabilities
Achieve efficient lightweight reasoning with minimal activated parameters
Balance advanced and general capabilities in reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight MoE model with 2.75B parameters
High-quality data curation and training paradigms
Comprehensive reasoning and general capability coverage
🔎 Similar Papers
No similar papers found.
L
Ling Team
AI@Ant Group
C
Chilin Fu
C
Chunwei Wu
J
Jianwen Wang
J
Jingyu Hu
Liang Jiang
Liang Jiang
Professor, Pritzker School of Molecular Engineering, The University of Chicago
Quantum OpticsQuantum InformationQuantum Technologies
M
Meng Li
P
Peng Jiao
P
Pingping Liu
S
Shaomian Zheng
S
Shiwei Liang
S
Shuaicheng Li
Y
Yalin Zhang
Y
Yingting Wu
Y
Yongkang Liu
Z
Zhenyu Huang