🤖 AI Summary
In telecom continuous deployment, pipeline task flakiness frequently causes update interruptions, yet systematic root-cause analysis and prioritization remain underexplored. Method: We systematically analyzed 4,511 failed pipeline executions, identified 46 distinct root-cause categories, and—novel in software engineering—adapted the commercial RFM model (Recency, Frequency, Monetary impact) into a transferable flakiness classification and prioritization framework. Our approach integrates manual root-cause annotation, hierarchical clustering, and multidimensional weighted RFM scoring, augmented by industrial-scale log analysis. Contribution/Results: We distilled 14 high-priority flaky categories, characterized their temporal evolution and business impact, and validated the framework in TELUS’s production environment. It significantly improved diagnosis efficiency and fills a critical gap in automated flakiness attribution and prioritization research.
📝 Abstract
The continuous delivery of modern software requires the execution of many automated pipeline jobs. These jobs ensure the frequent release of new software versions while detecting code problems at an early stage. For TELUS, our industrial partner in the telecommunications field, reliable job execution is crucial to minimize wasted time and streamline Continuous Deployment (CD). In this context, flaky job failures are one of the main issues hindering CD. Prior studies proposed techniques based on machine learning to automate the detection of flaky jobs. While valuable, these solutions are insufficient to address the waste associated with the diagnosis of flaky failures, which remain largely unexplored due to the wide range of underlying causes. This study examines 4,511 flaky job failures at TELUS to identify the different categories of flaky failures that we prioritize based on Recency, Frequency, and Monetary (RFM) measures. We identified 46 flaky failure categories that we analyzed using clustering and RFM measures to determine 14 priority categories for future automated diagnosis and repair research. Our findings also provide valuable insights into the evolution and impact of these categories. The identification and prioritization of flaky failure categories using RFM analysis introduce a novel approach that can be used in other contexts.