Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address instability in reinforcement learning training of Mixture-of-Experts (MoE) large language models, multi-domain data conflicts, and low inference efficiency, this paper introduces Ring-lite—a lightweight MoE inference model built upon Ling-lite. Methodologically, we propose the C3PO algorithm to stabilize MoE-PPO training; replace conventional validation metrics with entropy loss as the criterion for selecting knowledge distillation checkpoints; and design a two-stage curriculum training paradigm to mitigate domain interference. Evaluated on benchmarks including AIME, LiveCodeBench, and GPQA-Diamond, Ring-lite achieves performance on par with state-of-the-art small models while activating only 2.75B parameters (total parameters: 16.8B—roughly one-third of comparable SOTA models) and delivering significantly higher throughput. The model, training data, and code are fully open-sourced.

Technology Category

Application Category

📝 Abstract
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.
Problem

Research questions and friction points this paper is trying to address.

Optimizing large language models for efficient reasoning via reinforcement learning
Addressing instability in mixture-of-experts reinforcement learning training
Improving performance-efficiency trade-offs with novel distillation checkpoint selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts model optimized via reinforcement learning
Constrained Contextual Computation Policy Optimization (C3PO)
Two-stage training for multi-domain data integration
🔎 Similar Papers
No similar papers found.
R
Ring Team
B
Bin Hu
C
Cai Chen
D
Deng Zhao
D
Ding Liu
D
Dingnan Jin
F
Feng Zhu
H
Hao Dai
H
Hongzhi Luan
J
Jia Guo
J
Jiaming Liu
J
Jiewei Wu
J
Jun Mei
J
Jun Zhou
J
Junbo Zhao
J
Junwu Xiong
K
Kaihong Zhang
Kuan Xu
Kuan Xu
Nanyang Technological University
roboticsvisual SLAM
Lei Liang
Lei Liang
Ant Group
Knowledge GraphAI
Liang Jiang
Liang Jiang
Professor, Pritzker School of Molecular Engineering, The University of Chicago
Quantum OpticsQuantum InformationQuantum Technologies
L
Liangcheng Fu
L
Longfei Zheng
Qiang Gao
Qiang Gao
Wuhan University
MoERAGNatural Language Processing
Q
Qing Cui
Q
Quan Wan
S
Shaomian Zheng
S
Shuaicheng Li
T
Tongkai Yang
W
Wang Ren
Xiaodong Yan
Xiaodong Yan
Unknown affiliation
统计学,机器学习
X
Xiaopei Wan
X
Xiaoyun Feng
X
Xin Zhao
X
Xinxin Yang
X
Xinyu Kong
X
Xue-zhou Yang
Y
Yang Li
Y
Yingting Wu
Y
Yongkang Liu
Z
Zhankai Xu
Z
Zhenduo Zhang
Z
Zhenyu Huang
Z
Zhiqiang Zhang
Z
Zihao Wang
Z
Zujie Wen