đ¤ AI Summary
This study addresses the lack of systematic investigation into the readability of code generated by large language models (LLMs), a gap that hinders their practical adoption and maintainability in software development. The authors propose the first multidimensional readability quantification model integrating textual, structural, programmatic, and visual features. Leveraging large-scale datasets from WoC and LeetCode, they conduct a comprehensive evaluation across 5,869 scenarios involving mainstream LLMs and perform prompt ablation experiments to identify key influencing factors. Their findings reveal that while LLM-generated code exhibits overall readability comparable to human-written code, it manifests distinct problematic patterns. Specifically, function signatures, constraint specifications, and style descriptions emerge as critical prompt elements affecting readability, whereas prompt engineering as a whole demonstrates limited efficacy, highlighting potential technical debt risks.
đ Abstract
As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before adoption, it is important to understand its readability especially compared with human-written code and the role of prompt design in shaping it. We therefore set out to conduct a systematic investigation into the code readability of LLM-generated code. To systematically quantify code readability, We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Based on the model, we evaluate the readability of code generated by the mainstream LLMs under 5,869 scenarios extracted from large code base including World of Code (WoC) and LeetCode. We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns. We further examine how different prompt dimensions affect the readability of LLM-generated code, and find that function signatures, constraints and style descriptions emerge as the most influential factors, while the overall impact of prompt design remains limited. Our findings indicate that, on one hand, LLM-generated code is at least comparable to human-written code in readability, validating its potential for systematic integration into software workflows from a non-functional perspective; on the other hand, distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt, highlighting the need for future research to improve the readability of LLM-generated code and thus ensure long-term maintainability.