🤖 AI Summary
This work proposes an efficient approach to extend the inference token budget of large language models for competitive programming. By integrating validation-based reinforcement learning with stochastic token pruning during training, and deploying an end-to-end trained multi-round parallel chain-of-thought pipeline at test time, the method enables synergistic optimization across concurrent generation, verification, and refinement threads. Evaluated on 456 challenging problems from AetherCode, the framework achieves a pass@1 performance using an average of 7.6 million tokens per problem—matching the oracle-level pass@16 performance of the underlying reinforcement learning model and surpassing GPT-5-high—thereby establishing the first highly effective code reasoning framework with aligned training and inference architectures.
📝 Abstract
We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.