🤖 AI Summary
Large language models (LLMs) frequently generate code with suboptimal runtime efficiency, limiting their deployment in performance-critical applications. To address this, we propose an efficiency-aware reinforcement learning framework that jointly optimizes code correctness and execution efficiency via a dynamic exploration mechanism, error-insensitive reward modeling, and a two-stage fine-tuning strategy. Our method integrates offline pretraining, online fine-tuning, and performance-driven fine-grained reward modeling—thereby overcoming reliance on static datasets. Experiments on a 7B-parameter LLM demonstrate a 10.18% absolute improvement in functional correctness and a 7.75% reduction in average execution time, achieving performance competitive with significantly larger models. To our knowledge, this is the first work to empirically verify simultaneous, measurable gains in both correctness and efficiency for LLM-generated code.
📝 Abstract
While code large language models have demonstrated remarkable progress in code generation, the generated code often exhibits poor runtime efficiency, limiting its practical application in performance-sensitive scenarios. To address this limitation, we propose an efficiency-oriented reinforcement learning framework guided by a novel performance reward. Based on this framework, we take a deeper dive into the code efficiency problem, identifying then proposing methods to overcome key bottlenecks: (1) Dynamic exploration overcomes the static data constraints of offline fine-tuning, enabling the discovery of more efficient code implementations. (2) The error-insensitive reinforcement learning method and high-contrast efficiency signals are crucial for mitigating systematic errors and achieving effective optimization. (3) Online exploration is most effective when starting from a high-correctness baseline, as this allows for efficiency improvements without sacrificing accuracy. With these discoveries, we finally propose a two-stage tuning method, which achieves high and balanced performance across correctness and efficiency. The results of experiments show the effectiveness of the method, which improves code correctness by 10.18% and runtime efficiency by 7.75% on a 7B model, achieving performance comparable to much larger model.