Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work proposes an efficient approach to extend the inference token budget of large language models for competitive programming. By integrating validation-based reinforcement learning with stochastic token pruning during training, and deploying an end-to-end trained multi-round parallel chain-of-thought pipeline at test time, the method enables synergistic optimization across concurrent generation, verification, and refinement threads. Evaluated on 456 challenging problems from AetherCode, the framework achieves a pass@1 performance using an average of 7.6 million tokens per problem—matching the oracle-level pass@16 performance of the underlying reinforcement learning model and surpassing GPT-5-high—thereby establishing the first highly effective code reasoning framework with aligned training and inference architectures.

Technology Category

Application Category

📝 Abstract

We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.

Problem

Research questions and friction points this paper is trying to address.

reasoning tokens

competitive programming

scaling

reinforcement learning

parallel thinking

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning

parallel thinking

reasoning tokens