RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Current large language models (LLMs) struggle to effectively leverage execution feedback for multi-step iterative code optimization, as their independent sampling paradigm suffers from low sample efficiency and weak error correction capability. Method: This paper introduces the first end-to-end reinforcement learning framework for code synthesis, integrating execution trace modeling, feedback-conditioned action space design, and the Proximal Policy Optimization (PPO) algorithm—enabling models to dynamically respond to automated execution feedback and autonomously refine generated code during inference. Contribution/Results: Evaluated on competitive programming tasks, our approach significantly improves reliability in multi-step iterative refinement: both 8B- and 70B-parameter models establish new state-of-the-art (SOTA) performance, achieving substantial accuracy gains while reducing required sampling volume by an order of magnitude. Moreover, feedback utilization is markedly enhanced, overcoming fundamental limitations of conventional single-step generation paradigms.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new state-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

Problem

Research questions and friction points this paper is trying to address.

Enhance code synthesis with execution feedback

Improve iterative code generation in LLMs

Reduce sample requirements in programming tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning for Execution Feedback

Code Synthesis Improvement with LLMs

State-of-the-art in Competitive Programming

🔎 Similar Papers

No similar papers found.