CSnake: Detecting Self-Sustaining Cascading Failure via Causal Stitching of Fault Propagations

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

In distributed systems, self-sustaining cascading failures—arising from multi-step dependency chains and specific environmental conditions—are notoriously difficult to expose prior to deployment. To address this, we propose *Causal Stitching*, a novel paradigm that integrates counterfactual causal analysis (FCA), a three-stage test budget allocation strategy, and local compatibility checking into an end-to-end detection framework. Our approach combines controlled fault injection, execution trace comparison, path-constraint solving, and causal reasoning to precisely identify conditional cascading failure paths. Evaluated on five real-world distributed systems, the framework uncovered 15 self-sustaining cascading failure vulnerabilities; five were officially confirmed by maintainers, and two have already been patched. This significantly enhances pre-deployment exposure of complex, condition-dependent failures and improves the interpretability and diagnostic rigor of root-cause analysis.

Technology Category

Application Category

📝 Abstract

Recent studies have revealed that self-sustaining cascading failures in distributed systems frequently lead to widespread outages, which are challenging to contain and recover from. Existing failure detection techniques struggle to expose such failures prior to deployment, as they typically require a complex combination of specific conditions to be triggered. This challenge stems from the inherent nature of cascading failures, as they typically involve a sequence of fault propagations, each activated by distinct conditions. This paper presents CSnake, a fault injection framework to expose self-sustaining cascading failures in distributed systems. CSnake uses the novel idea of causal stitching, which causally links multiple single-fault injections in different tests to simulate complex fault propagation chains. To identify these chains, CSnake designs a counterfactual causality analysis of fault propagations - fault causality analysis (FCA): FCA compares the execution trace of a fault injection run with its corresponding profile run (i.e., same test w/o the injection) and identifies any additional faults triggered, which are considered to have a causal relationship with the injected fault. To address the large search space of fault and workload combinations, CSnake employs a three-phase allocation protocol of test budget that prioritizes faults with unique and diverse causal consequences, increasing the likelihood of uncovering conditional fault propagations. Furthermore, to avoid incorrectly connecting fault propagations from workloads with incompatible conditions, CSnake performs a local compatibility check that approximately checks the compatibility of the path constraints associated with connected fault propagations with low overhead. CSnake detected 15 bugs that cause self-sustaining cascading failures in five systems, five of which have been confirmed with two fixed.

Problem

Research questions and friction points this paper is trying to address.

Detects self-sustaining cascading failures in distributed systems

Simulates complex fault propagation chains via causal stitching

Addresses large search space of fault and workload combinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal stitching links single-fault injections across tests

Counterfactual causality analysis identifies fault propagation chains

Three-phase allocation prioritizes faults with diverse consequences

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis

2024-06-27arXiv.orgCitations: 1

Cerebras Systems

Remote, California, United States / Sunnyvale CA or Toronto Canada / Headquarters/Sunnyvale Office, Sunnyvale, CA

Lead Engineer, ML Network Stack - Annapurna Labs

Amazon

Seattle, Washington, USA / Cupertino, California, USA

AI/HPC System Performance Engineer