🤖 AI Summary
This study addresses the lack of systematic understanding of software testing practices in human–AI collaborative development, particularly regarding differences in testing behavior at the pull request (PR) level. Leveraging the large-scale AIDev dataset, we conduct a comparative analysis of 6,582 human–AI collaborative PRs against 3,122 purely human-authored PRs, examining testing frequency, types of changes, and test quality. Our findings reveal that AI involvement significantly enhances test coverage: collaborative PRs exhibit a slightly higher likelihood of including tests (42.9% vs. 40.0%), nearly double the proportion of test code, and a stronger tendency to introduce new tests. However, no significant difference is observed in test quality, as measured by test smell detection. This work provides the first systematic characterization of how AI agents influence software testing practices in real-world collaborative settings.
📝 Abstract
AI-based coding agents are increasingly integrated into software development workflows, collaborating with developers to create pull requests (PRs). Despite their growing adoption, the role of human-agent collaboration in software testing remains poorly understood. This paper presents an empirical study of 6,582 human-agent PRs (HAPRs) and 3,122 human PRs (HPRs) from the AIDev dataset. We compare HAPRs and HPRs along three dimensions: (i) testing frequency and extent, (ii) types of testing-related changes (code-and-test co-evolution vs. test-focused), and (iii) testing quality, measured by test smells. Our findings reveal that, although the likelihood of including tests is comparable (42.9% for HAPRs vs. 40.0% for HPRs), HAPRs exhibit a larger extent of testing, nearly doubling the test-to-source line ratio found in HPRs. While test-focused task distributions are comparable, HAPRs are more likely to add new tests during co-evolution (OR=1.79), whereas HPRs prioritize modifying existing tests. Finally, although some test smell categories differ statistically, negligible effect sizes suggest no meaningful differences in quality. These insights provide the first characterization of how human-agent collaboration shapes testing practices.