🤖 AI Summary
This work addresses the limitations of existing reinforcement learning approaches for code large language models that rely on unit tests, which are often scarce and unreliable, hindering scalable training and inference. To overcome this, the authors propose CodeScaler—a test-case-free reward model for code generation—leveraging verified preference data, syntax-aware code extraction, and validity-preserving reward shaping to enable efficient scaling. Evaluated across four code benchmarks, CodeScaler outperforms execution-feedback-based RL by up to 4.23 points. Notably, when scaled to 44K problems without any test cases, it achieves a 14.64-point improvement while reducing inference latency by an order of magnitude. Furthermore, it surpasses prior methods by 3.3 points on coding tasks and by an average of 2.7 points across general and reasoning tasks in RM-Bench.
📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).