Cast: Automated Resilience Testing for Production Cloud Service Systems

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively validating system resilience in microservice architectures, where distributed complexity renders traditional manual testing inadequate. To this end, the authors propose an end-to-end automated resilience testing framework that replays real production traffic and injects application-level faults within a structured three-phase pipeline—initialization, fault injection, and recovery—coupled with a multi-dimensional evaluation mechanism to systematically assess resilience. The framework innovatively incorporates complexity-driven test pruning and prioritization strategies to efficiently cover critical execution paths. Deployed over eight months across four large-scale Huawei Cloud applications handling millions of calls, the approach identified 137 potential resilience vulnerabilities (89 confirmed) and achieved 90% detection coverage on 48 reproducible defects, significantly enhancing both testing fidelity and efficiency.

Technology Category

Application Category

📝 Abstract
The distributed nature of microservice architecture introduces significant resilience challenges. Traditional testing methods, limited by extensive manual effort and oversimplified test environments, fail to capture production system complexity. To address these limitations, we present Cast, an automated, end-to-end framework for microservice resilience testing in production. It achieves high test fidelity by replaying production traffic against a comprehensive library of application-level faults to exercise internal error-handling logic. To manage the combinatorial test space, Cast employs a complexity-driven strategy to systematically prune redundant tests and prioritize high-value tests targeting the most critical service execution paths. Cast automates the testing lifecycle through a three-phase pipeline (i.e., startup, fault injection, and recovery) and uses a multi-faceted oracle to automatically verify system resilience against nuanced criteria. Deployed in Huawei Cloud for over eight months, Cast has been adopted by many service teams to proactively address resilience vulnerabilities. Our analysis on four large-scale applications with millions of traces reveals 137 potential vulnerabilities, with 89 confirmed by developers. To further quantify its performance, Cast is evaluated on a benchmark set of 48 reproduced bugs, achieving a high coverage of 90%. The results show that Cast is a practical and effective solution for systematically improving the reliability of industrial microservice systems.
Problem

Research questions and friction points this paper is trying to address.

microservice resilience
production testing
fault injection
system reliability
distributed systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

resilience testing
microservice architecture
production traffic replay
fault injection
automated testing
🔎 Similar Papers
No similar papers found.
Zhuangbin Chen
Zhuangbin Chen
Assistant Professor, School of Software Engineering, Sun Yat-sen University
Software EngineeringDistributed SystemsCloud ComputingLLM Systems
Z
Zhiling Deng
School of Software Engineering, Sun Yat-sen University
K
Kaiming Zhang
School of Software Engineering, Sun Yat-sen University
Y
Yang Liu
School of Software Engineering, Sun Yat-sen University
Cheng Cui
Cheng Cui
BUAA
deep learningnetwork designOCRmllm
J
Jinfeng Zhong
Huawei Cloud
Zibin Zheng
Zibin Zheng
IEEE Fellow, Highly Cited Researcher, Sun Yat-sen University, China
BlockchainSmart ContractServices ComputingSoftware Reliability