Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
LLMs frequently generate code with suboptimal runtime efficiency, hindering practical deployment. To address this, we propose GRPO—the first test-time reinforcement learning framework for closed-loop code efficiency optimization. GRPO leverages an execution sandbox to dynamically collect runtime performance feedback during inference, enabling the model to iteratively refine its code generation without retraining. Unlike supervised fine-tuning (SFT) or direct preference optimization (DPO), GRPO operates entirely at inference time, requiring no human annotations or additional training data—only lightweight, execution-based signals drive optimization. On the Venus and APPS benchmarks, GRPO improves pass@1 by 15 percentage points (from 47% to 62%). Moreover, its generated efficient code surpasses 45% of human submissions—an absolute gain of 14%—demonstrating, for the first time, the effectiveness and practicality of test-time RL for code efficiency optimization.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization~(GRPO). Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both pass@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency.
Problem

Research questions and friction points this paper is trying to address.

LLMs generate inefficient code needing optimization
Test-time iterative framework improves code efficiency
Reinforcement learning enables continuous self-improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes code efficiency iteratively
Closed-loop system refines code using execution feedback
Group Relative Policy Optimization outperforms human submissions
🔎 Similar Papers
No similar papers found.