🤖 AI Summary
Prior work lacks empirical understanding of AI coding agents’ (e.g., Codex, Claude Code, Cursor) code refactoring behavior in real-world settings.
Method: This paper presents the first large-scale empirical study, analyzing over 10,000 Java open-source commits from the AIDev dataset. We integrate automated code quality metrics with manual classification of coding intents to systematically identify AI-driven refactoring activities and their underlying motivations.
Contribution/Results: We find that 26.1% of AI-generated commits explicitly target refactoring—predominantly localized consistency improvements such as variable renaming and type adjustments—motivated primarily by enhanced maintainability and readability. Refactoring significantly improves structural quality: median Class LOC decreases by 15.25%, and cyclomatic complexity is reduced. This study provides the first large-scale, project-based empirical evidence and behavioral characterization of AI-assisted refactoring, advancing sustainable software development through empirically grounded insights.
📝 Abstract
Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape. These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks. Agents have become active participants in refactoring, a cornerstone of sustainable software development aimed at improving internal code quality without altering observable behavior. Despite their increasing adoption, there is a critical lack of empirical understanding regarding how agentic refactoring is utilized in practice, how it compares to human-driven refactoring, and what impact it has on code quality. To address this empirical gap, we present a large-scale study of AI agent-generated refactorings in real-world open-source Java projects, analyzing 15,451 refactoring instances across 12,256 pull requests and 14,988 commits derived from the AIDev dataset. Our empirical analysis shows that refactoring is a common and intentional activity in this development paradigm, with agents explicitly targeting refactoring in 26.1% of commits. Analysis of refactoring types reveals that agentic efforts are dominated by low-level, consistency-oriented edits, such as Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%), reflecting a preference for localized improvements over the high-level design changes common in human refactoring. Additionally, the motivations behind agentic refactoring focus overwhelmingly on internal quality concerns, with maintainability (52.5%) and readability (28.1%). Furthermore, quantitative evaluation of code quality metrics shows that agentic refactoring yields small but statistically significant improvements in structural metrics, particularly for medium-level changes, reducing class size and complexity (e.g., Class LOC median $Delta$ = -15.25).