🤖 AI Summary
This work addresses the challenge that small-scale language models struggle to effectively enhance reasoning capabilities under sparse rewards when trained with Group Relative Policy Optimization (GRPO). To overcome this limitation, the authors propose CoDistill-GRPO, the first framework enabling bidirectional collaborative distillation between large and small models within GRPO. In this approach, the small model receives dense reward signals derived from the large model’s outputs, while the large model efficiently updates its policy using trajectories generated by the small model, augmented with importance reweighting—eliminating the need for a fixed pretrained teacher. Experiments on Qwen and Llama model families demonstrate substantial improvements in both training efficiency and small-model performance. Notably, Qwen2.5-Math-1.5B achieves an 11.6% absolute accuracy gain over the baseline and a further 6.0% improvement over standard GRPO on the Minerva benchmark, while the large model trains approximately 18% faster with nearly equivalent performance.
📝 Abstract
Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.