🤖 AI Summary
This paper investigates the code change volume—termed “review effort”—required for approval of GitLab merge requests (MRs), formally defined as the number of lines modified *after* initial submission.
Method: Leveraging a dataset of over 23,600 real-world MRs, we develop an interpretable machine learning model integrating textual features (e.g., description quality), code complexity metrics, developer experience, historical review patterns, and branch topology.
Contribution/Results: We find that 71% of MRs require post-submission modifications, with 28% involving ≥200 changed lines. Code complexity, developer experience, and textual features are the strongest predictors of review effort; notably, review effort shows no significant correlation with review duration or reviewer count. Our model achieves AUC scores of 0.84–0.88, demonstrating strong predictive validity. This work establishes a novel, empirically grounded paradigm for quantifying code review cost and optimizing collaborative review practices.
📝 Abstract
Code review (CR) is essential to software development, helping ensure that new code is properly integrated. However, the CR process often involves significant effort, including code adjustments, responses to reviewers, and continued implementation. While past studies have examined CR delays and iteration counts, few have investigated the effort based on the volume of code changes required, especially in the context of GitLab Merge Requests (MRs), which remains underexplored. In this paper, we define and measure CR effort as the amount of code modified after submission, using a dataset of over 23,600 MRs from four GitLab projects. We find that up to 71% of MRs require adjustments after submission, and 28% of these involve changes to more than 200 lines of code. Surprisingly, this effort is not correlated with review time or the number of participants. To better understand and predict CR effort, we train an interpretable machine learning model using metrics across multiple dimensions: text features, code complexity, developer experience, review history, and branching. Our model achieves strong performance (AUC 0.84-0.88) and reveals that complexity, experience, and text features are key predictors. Historical project characteristics also influence current review effort. Our findings highlight the feasibility of using machine learning to explain and anticipate the effort needed to integrate code changes during review.