🤖 AI Summary
To address instability in reinforcement learning training of Mixture-of-Experts (MoE) large language models, multi-domain data conflicts, and low inference efficiency, this paper introduces Ring-lite—a lightweight MoE inference model built upon Ling-lite. Methodologically, we propose the C3PO algorithm to stabilize MoE-PPO training; replace conventional validation metrics with entropy loss as the criterion for selecting knowledge distillation checkpoints; and design a two-stage curriculum training paradigm to mitigate domain interference. Evaluated on benchmarks including AIME, LiveCodeBench, and GPQA-Diamond, Ring-lite achieves performance on par with state-of-the-art small models while activating only 2.75B parameters (total parameters: 16.8B—roughly one-third of comparable SOTA models) and delivering significantly higher throughput. The model, training data, and code are fully open-sourced.
📝 Abstract
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.