🤖 AI Summary
This study presents the first empirical investigation into the evolution of CI/CD configurations in machine learning (ML) projects. Addressing the lack of understanding regarding how CI/CD configurations co-evolve with ML components, the authors analyze 508 open-source ML projects, 343 manually annotated commits, and 15,634 automated CI/CD commits. They propose a novel 14-category taxonomy capturing synergistic changes between CI/CD and ML components, develop a dedicated clustering tool to identify recurrent evolutionary patterns, and establish an empirically grounded model linking developer experience to CI/CD configuration modification behavior. Results show that 61.8% of CI/CD-related commits involve build strategy modifications; common anti-patterns—including dependency hardcoding and missing test frameworks—are identified; and senior developers modify CI/CD configurations more frequently and effectively than juniors, confirming the critical role of experience in CI/CD maintenance.
📝 Abstract
The growing popularity of machine learning (ML) and the integration of ML components with other software artifacts has led to the use of continuous integration and delivery (CI/CD) tools, such as Travis CI, GitHub Actions, etc. that enable faster integration and testing for ML projects. Such CI/CD configurations and services require synchronization during the life cycle of the projects. Several works discussed how CI/CD configuration and services change during their usage in traditional software systems. However, there is very limited knowledge of how CI/CD configuration and services change in ML projects. To fill this knowledge gap, this work presents the first empirical analysis of how CI/CD configuration evolves for ML software systems. We manually analyzed 343 commits collected from 508 open-source ML projects to identify common CI/CD configuration change categories in ML projects and devised a taxonomy of 14 co-changes in CI/CD and ML components. Moreover, we developed a CI/CD configuration change clustering tool that identified frequent CI/CD configuration change patterns in 15,634 commits. Furthermore, we measured the expertise of ML developers who modify CI/CD configurations. Based on this analysis, we found that 61.8% of commits include a change to the build policy and minimal changes related to performance and maintainability compared to general open-source projects. Additionally, the co-evolution analysis identified that CI/CD configurations, in many cases, changed unnecessarily due to bad practices such as the direct inclusion of dependencies and a lack of usage of standardized testing frameworks. More practices were found through the change patterns analysis consisting of using deprecated settings and reliance on a generic build language. Finally, our developer's expertise analysis suggests that experienced developers are more inclined to modify CI/CD configurations.