CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

📅 2026-02-04
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This work addresses the limitations of existing reinforcement learning approaches for code large language models that rely on unit tests, which are often scarce and unreliable, hindering scalable training and inference. To overcome this, the authors propose CodeScaler—a test-case-free reward model for code generation—leveraging verified preference data, syntax-aware code extraction, and validity-preserving reward shaping to enable efficient scaling. Evaluated across four code benchmarks, CodeScaler outperforms execution-feedback-based RL by up to 4.23 points. Notably, when scaled to 44K problems without any test cases, it achieves a 14.64-point improvement while reducing inference latency by an order of magnitude. Furthermore, it surpasses prior methods by 3.3 points on coding tasks and by an average of 2.7 points across general and reasoning tasks in RM-Bench.
📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).
Problem

Research questions and friction points this paper is trying to address.

code generation
reinforcement learning
reward models
test cases
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward Model
Code LLM
Test-Time Scaling
Reinforcement Learning
Preference Learning
🔎 Similar Papers