🤖 AI Summary
In software testing, flaky tests—those exhibiting non-deterministic pass/fail behavior—often fail in clusters due to shared root causes (e.g., network instability, transient external dependencies), forming “systemic flakiness,” which substantially increases debugging and repair costs. This phenomenon challenges the conventional assumption that flakiness is isolated and test-specific.
Method: We introduce the concept of systemic flakiness and empirically analyze 24 Java open-source projects, revealing that 75% of flaky tests belong to co-occurring clusters (average size: 13.5 tests per cluster). To efficiently identify such clusters, we propose a lightweight ML model leveraging static test-distance features, combined with bottom-up clustering and manual stack-trace validation.
Contribution/Results: Evaluated on 810 flaky tests, our approach achieves high detection accuracy and enables elimination of thousands of redundant test executions. This work establishes a root-cause–driven paradigm for batch diagnosis and remediation of flakiness, shifting focus from individual test fixes to systemic root-cause resolution.
📝 Abstract
Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers. An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. We discovered that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness. This suggests that developers can reduce repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually. This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences. We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. It contains 810 flaky tests, which we levered to perform a mixed-method empirical analysis of co-occurring flaky test failures. Systemic flakiness is significant and widespread. We performed agglomerative clustering of flaky tests based on their failure co-occurrence, finding that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests. Instead of requiring 10,000 test suite runs to identify systemic flakiness, we demonstrated a lightweight alternative by training machine learning models based on static test case distance measures. Through manual inspection of stack traces, conducted independently by four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness.