🤖 AI Summary
Cloud service reliability assessment lacks empirical validation from diverse, real-world sources, and existing studies fail to characterize cross-layer fault propagation and fault-tolerance efficacy. Method: We construct the first open-source cloud service availability data warehouse, integrating user-reported incidents with operator logs across web services, cloud platforms, and online gaming. We propose a dual-perspective (user-side/operations-side), cross-layer fault data fusion methodology and develop a reproducible simulation framework grounded in real fault traces to quantify performance trade-offs of checkpointing and retry strategies under heterogeneous failure modes. Contribution/Results: Our analysis reveals a counterintuitive yet critical insight: high-level services—due to robust fault-tolerant design—exhibit lower observed failure rates than underlying infrastructure. We publicly release an annotated dataset (GitHub) and analytical tooling, establishing an empirical foundation and methodological framework for cloud reliability modeling and fault-tolerance evaluation.
📝 Abstract
Cloud services are critical to society. However, their reliability is poorly understood. Towards solving the problem, we propose a standard repository for cloud uptime data. We populate this repository with the data we collect containing failure reports from users and operators of cloud services, web services, and online games. The multiple vantage points help reduce bias from individual users and operators. We compare our new data to existing failure data from the Failure Trace Archive and the Google cluster trace. We analyze the MTBF and MTTR, time patterns, failure severity, user-reported symptoms, and operator-reported symptoms of failures in the data we collect. We observe that high-level user facing services fail less often than low-level infrastructure services, likely due to them using fault-tolerance techniques. We use simulation-based experiments to demonstrate the impact of different failure traces on the performance of checkpointing and retry mechanisms. We release the data, and the analysis and simulation tools, as open-source artifacts available at https://github.com/atlarge-research/cloud-uptime-archive .