🤖 AI Summary
Understanding long-term failure patterns and their root causes in large-scale system-level test suites remains an open challenge, particularly for production operating systems undergoing continuous evolution.
Method: We conduct a longitudinal empirical study of NetBSD’s virtualization automation test suite—operational continuously from the early 2010s through 2025—analyzing over 10,000 test executions to quantify test growth, failure stability, build breaks, installation failures, and incomplete tests.
Contribution/Results: We find that while test suite size grows steadily, the overall failure rate remains stable; short-term fluctuations occur, yet critical failures (e.g., build/install failures) exhibit no statistically significant long-term correlation with code commits or kernel modifications. This reveals, for the first time, a “failure decoupling” phenomenon in decade-scale system testing—challenging conventional fault attribution assumptions. Our work establishes an empirical foundation and methodological framework for assessing test infrastructure resilience and modeling test suite evolution in complex OS ecosystems.
📝 Abstract
The paper presents a longitudinal empirical analysis of the automated, continuous, and virtualization-based software test suite of the NetBSD operating system. The longitudinal period observed spans from the initial roll out of the test suite in the early 2010s to late 2025. According to the results, the test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability, although there have been shorter periods marked with more frequent failures. A similar observation applies to build failures, failures of the test suite to complete, and installation failures, all of which are also captured by the NetBSD's testing framework. Finally, code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures. Although some periods exhibit larger effects, including particularly with respect to the kernel modifications, the effects are small on average. Even though only in an exploratory manner, these empirical observations contribute to efforts to draw conclusions from large-scale and evolving software test suites.