🤖 AI Summary
This work addresses the prevalent overemphasis on functional correctness—while neglecting runtime efficiency—in large language model (LLM)-based code generation. We propose the first dual-objective fine-tuning paradigm jointly optimizing for both execution efficiency and functional correctness. Methodologically, we construct a high-quality, two-dimensional fine-tuning dataset via multi-model collaborative sampling and rigorous local sandbox evaluation (measuring execution time and memory footprint) to identify optimal solutions; we further introduce efficiency-driven data cleaning and supervised fine-tuning (SFT). Our key contributions are: (i) the first incorporation of empirically measured runtime efficiency as a core optimization objective in LLM code-generation fine-tuning, and (ii) an automated, efficiency-aware solution selection mechanism. Evaluated on Qwen2.5-Coder-7B-Instruct, our approach achieves a 12.9-percentage-point improvement in pass@1 (reaching 57.7%) and reduces average execution time per correct task by 48.4%, substantially enhancing the practical utility and deployability of generated code.
📝 Abstract
As large language models (LLMs) play an increasingly important role in code generation, enhancing both correctness and efficiency has become crucial. Current methods primarily focus on correctness, often overlooking efficiency. To address this gap, we introduce dataset to improve both aspects by fine-tuning LLMs on a high-quality dataset comprising correct and efficient code samples. Our methodology involves leveraging multiple LLMs to generate diverse candidate code solutions for various tasks across different programming languages. We then evaluate these solutions by directly measuring their execution time and memory usage through local execution. The code solution with the lowest execution time and memory consumption is selected as the final output for each task. Experimental results demonstrate significant improvements when fine-tuning with dataset. For instance, Qwen2.5-Coder-7B-Instruct's pass@1 score increases from 44.8% to 57.7%, while the average execution time for correct tasks decreases by 48.4%. dataset offers a scalable and effective solution for advancing AI-driven code generation, benefiting both software development and computational problem-solving. The source code of Effi-Code was released in https://github.com/huangd1999/Effi-Code.