Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis

📅 2024-06-27
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Microservice systems suffer from low fault localization accuracy and inefficient diagnosis due to strong inter-component interactions and independent deployment, severely undermining system reliability. To address this, we conduct a systematic literature review of 98 publications spanning 2003–2024. Our method integrates qualitative analysis, systematic review, and empirical investigation to construct a standardized diagnostic knowledge graph. We propose a novel multidimensional classification framework covering problem definitions, architectural paradigms, diagnostic dimensions, and evaluation criteria; unify publicly available datasets, toolchains, and metrics; and deliver a reusable technology selection guide. The study establishes the first comprehensive fault diagnosis landscape for microservices, clarifies current bottlenecks and evolutionary trajectories, and enables rapid industrial validation and deployment. Results demonstrate significant improvements in fault localization efficiency and overall system stability.

Technology Category

Application Category

📝 Abstract
Widely adopted for their scalability and flexibility, modern microservice systems present unique failure diagnosis challenges due to their independent deployment and dynamic interactions. This complexity can lead to cascading failures that negatively impact operational efficiency and user experience. Recognizing the critical role of fault diagnosis in improving the stability and reliability of microservice systems, researchers have conducted extensive studies and achieved a number of significant results. This survey provides an exhaustive review of 98 scientific papers from 2003 to the present, including a thorough examination and elucidation of the fundamental concepts, system architecture, and problem statement. It also includes a qualitative analysis of the dimensions, providing an in-depth discussion of current best practices and future directions, aiming to further its development and application. In addition, this survey compiles publicly available datasets, toolkits, and evaluation metrics to facilitate the selection and validation of techniques for practitioners.
Problem

Research questions and friction points this paper is trying to address.

Microservices Systems
Error Localization
Reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Microservices
Fault Detection
Systematic Review
🔎 Similar Papers
No similar papers found.