Multi-Turn Code Generation Through Single-Step Rewards

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This paper addresses code generation under multi-turn execution feedback. Existing approaches either ignore feedback or rely on complex hierarchical reinforcement learning (HRL) to model multi-turn rewards. We establish, for the first time, that code generation with execution feedback can be formalized as a one-step recoverable Markov decision process (MDP), thereby eliminating the need for HRL and enabling efficient optimization using only single-step rewards. Our μCode framework introduces iterative joint training of a generator and a verifier, grounded in policy-gradient-based co-optimization, execution-feedback-conditioned decoding, single-step reward modeling, and iterative self-supervised fine-tuning. Extensive experiments across multiple benchmarks demonstrate substantial improvements over state-of-the-art methods, validating both the sufficiency of single-step rewards and the effectiveness of leveraging execution feedback in code generation.

Technology Category

Application Category

📝 Abstract

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $mu$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $mu$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $mu$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

Problem

Research questions and friction points this paper is trying to address.

Multi-turn code generation feedback utilization

Simplified approach using single-step rewards

Recoverable MDP for code generation efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-step rewards optimization

Iterative generator and verifier training

One-step recoverable MDP strategy

🔎 Similar Papers

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning