A Causal Perspective on Measuring, Explaining and Mitigating Smells in llm-Generated Code

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Code generated by large language models (LLMs) frequently reproduces anti-patterns, introducing code smells that impair readability and maintainability; however, their root causes and evolutionary patterns remain poorly understood. Method: We propose the first causality-aware framework for assessing code smell propensity in LLM-generated code, centered on the Propensity Smell Score (PSC)—a novel metric quantifying how generation strategies, model scale, architecture, and prompt design systematically influence structural quality. Contribution/Results: Through empirical modeling and a developer user study, we validate that PSC effectively captures structural defects and supports actionable development decisions. We further demonstrate that refining prompt engineering and model architecture significantly reduces smell incidence. This work advances a quality-aware paradigm for LLM-based code generation and provides both theoretical foundations and practical tools for trustworthy AI-assisted programming.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have accelerated their adoption in software engineering contexts. However, concerns persist about the structural quality of the code they produce. In particular, LLMs often replicate poor coding practices, introducing code smells (i.e., patterns that hinder readability, maintainability, or design integrity). Although prior research has examined the detection or repair of smells, we still lack a clear understanding of how and when these issues emerge in generated code. This paper addresses this gap by systematically measuring, explaining and mitigating smell propensity in LLM-generated code. We build on the Propensity Smelly Score (PSC), a probabilistic metric that estimates the likelihood of generating particular smell types, and establish its robustness as a signal of structural quality. Using PSC as an instrument for causal analysis, we identify how generation strategy, model size, model architecture and prompt formulation shape the structural properties of generated code. Our findings show that prompt design and architectural choices play a decisive role in smell propensity and motivate practical mitigation strategies that reduce its occurrence. A user study further demonstrates that PSC helps developers interpret model behavior and assess code quality, providing evidence that smell propensity signals can support human judgement. Taken together, our work lays the groundwork for integrating quality-aware assessments into the evaluation and deployment of LLMs for code.

Problem

Research questions and friction points this paper is trying to address.

Measuring structural quality issues in LLM-generated code

Explaining causes of code smells through generation parameters

Developing mitigation strategies for smell-prone generated code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using causal analysis to explain code smell origins

Proposing Propensity Smelly Score as quality metric

Developing mitigation strategies through prompt optimization

🔎 Similar Papers

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study