Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the limitations of conventional training methods in effectively enhancing the performance of modern code generation models. To overcome this challenge, the authors propose MicroCoder-GRPO, an improved grouped relative policy optimization algorithm that significantly boosts long-sequence generation capability and solution-space diversity by incorporating a conditional truncation mask, a diversity-driven temperature selection mechanism, and the removal of KL divergence loss under high clipping ratios. Evaluated on LiveCodeBench v6 using the newly constructed MicroCoder-Dataset and the accompanying MicroCoder-Evaluator framework, the approach achieves up to a 17.6% improvement over strong baselines. Notably, the new dataset yields a threefold performance gain within just 300 training steps, while the evaluation framework demonstrates approximately 25% higher accuracy and 40% faster execution.

Technology Category

Application Category

📝 Abstract

Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts.

Problem

Research questions and friction points this paper is trying to address.

code generation

reinforcement learning

training bottlenecks

model training

long output

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Relative Policy Optimization

conditional truncation masking

diversity-determined temperature