🤖 AI Summary
Current large language models (LLMs) struggle to effectively leverage execution feedback for multi-step iterative code optimization, as their independent sampling paradigm suffers from low sample efficiency and weak error correction capability. Method: This paper introduces the first end-to-end reinforcement learning framework for code synthesis, integrating execution trace modeling, feedback-conditioned action space design, and the Proximal Policy Optimization (PPO) algorithm—enabling models to dynamically respond to automated execution feedback and autonomously refine generated code during inference. Contribution/Results: Evaluated on competitive programming tasks, our approach significantly improves reliability in multi-step iterative refinement: both 8B- and 70B-parameter models establish new state-of-the-art (SOTA) performance, achieving substantial accuracy gains while reducing required sampling volume by an order of magnitude. Moreover, feedback utilization is markedly enhanced, overcoming fundamental limitations of conventional single-step generation paradigms.
📝 Abstract
Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new state-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.