🤖 AI Summary
Chain-of-thought reasoning is constrained by model capability limits and its sequential nature hinders test-time scalability. This work proposes the first end-to-end reinforcement learning framework that enables large language models to acquire divide-and-conquer reasoning capabilities—dynamically decomposing problems, solving subproblems sequentially, and conditionally integrating results. By incorporating the complete divide-and-conquer process into reinforcement learning training, this approach resolves the fundamental mismatch between generic post-training and structured reasoning. Evaluated on competitive benchmarks, the method achieves an 8.6% improvement in Pass@1 and a 6.3% gain in Pass@32, significantly outperforming existing chain-of-thought approaches.
📝 Abstract
Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs'reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.