🤖 AI Summary
Large language models (LLMs) exhibit high failure rates and produce uninterpretable feedback when generating code. Method: This paper systematically identifies a strong negative correlation between code complexity—measured via cyclomatic complexity, Halstead metrics—and LLM code correctness (Pass@1), and proposes a complexity-driven iterative feedback mechanism. It integrates static complexity analysis, logistic regression-based feature selection, multi-round complexity-guided prompting, and the Reflexion agent framework to enable targeted repair of failed code generations. Contribution/Results: On HumanEval, our approach improves GPT-3.5 Turbo’s Pass@1 by 35.71%; on BigCodeBench, combined with Reflexion, it boosts Pass@1 by 20.0% for GPT-4o and 23.1% for GPT-4o mini. This work pioneers the integration of quantifiable code complexity analysis into the LLM code generation feedback loop, substantially enhancing both generation success rate and interpretability.
📝 Abstract
Automatic code generation has gained significant momentum with the advent of Large Language Models (LLMs) such as GPT-4. Although many studies focus on improving the effectiveness of LLMs for code generation, very limited work tries to understand the generated code's characteristics and leverage that to improve failed cases. In this paper, as the most straightforward characteristic of code, we investigate the relationship between code complexity and the success of LLM generated code. Using a large set of standard complexity metrics, we first conduct an empirical analysis to explore their correlation with LLM's performance on code generation (i.e., Pass@1). Using logistic regression models, we identify which complexity metrics are most predictive of code correctness. Building on these findings, we propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs. We validate our approach across multiple benchmarks (i.e., HumanEval, MBPP, LeetCode, and BigCodeBench) and various LLMs (i.e., GPT-4o, GPT-3.5 Turbo, Llama 3.1, and GPT-o3 mini), comparing the results with two baseline methods: (a) zero-shot generation, and (b) iterative execution-based feedback without our code complexity insights. Experiment results show that our approach makes notable improvements, particularly with a smaller LLM (GPT3.5 Turbo), where, e.g., Pass@1 increased by 35.71% compared to the baseline's improvement of 12.5% on the HumanEval dataset. The study expands experiments to BigCodeBench and integrates the method with the Reflexion code generation agent, leading to Pass@1 improvements of 20% (GPT-4o) and 23.07% (GPT-o3 mini). The results highlight that complexity-aware feedback enhances both direct LLM prompting and agent-based workflows.