Static Analysis as a Feedback Loop: Enhancing LLM-Generated Code Beyond Correctness

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing code generation benchmarks (e.g., HumanEval, MBPP) evaluate only functional correctness, neglecting critical quality dimensions such as security, reliability, readability, and maintainability. To address this gap, we propose a static analysis–driven iterative prompting framework that systematically integrates multidimensional quality feedback into LLM-based code generation. Our method employs Bandit and Pylint for static analysis to detect violations across these dimensions; GPT-4o then generates targeted repair prompts based on the analysis results, enabling closed-loop optimization. Experiments demonstrate that after ten iterations, security vulnerabilities decrease by 67%, readability violations drop from 80% to 11%, and reliability warnings decline by 78%. This work advances beyond conventional evaluation paradigms by introducing a scalable, quality-aware methodology for controllable optimization of LLM-generated code—establishing the first systematic approach to enforce non-functional requirements in generative code synthesis.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated impressive capabilities in code generation, achieving high scores on benchmarks such as HumanEval and MBPP. However, these benchmarks primarily assess functional correctness and neglect broader dimensions of code quality, including security, reliability, readability, and maintainability. In this work, we systematically evaluate the ability of LLMs to generate high-quality code across multiple dimensions using the PythonSecurityEval benchmark. We introduce an iterative static analysis-driven prompting algorithm that leverages Bandit and Pylint to identify and resolve code quality issues. Our experiments with GPT-4o show substantial improvements: security issues reduced from >40% to 13%, readability violations from >80% to 11%, and reliability warnings from >50% to 11% within ten iterations. These results demonstrate that LLMs, when guided by static analysis feedback, can significantly enhance code quality beyond functional correctness.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM-generated code quality beyond correctness

Addressing security, reliability, and readability issues in generated code

Reducing static analysis violations through feedback-enhanced generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative static analysis-driven prompting algorithm

Leveraging Bandit and Pylint tools

Resolving code quality issues systematically

🔎 Similar Papers

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study