π€ AI Summary
This study investigates the practical implementation and evolutionary trajectory of Chaos Engineering (CE) within DevOps practices to enhance the resilience of distributed systems in dynamic production environments.
Method: A systematic grey literature review is conducted on 50 industrial case studies published between 2019 and 2024, identifying application patterns and implementation mechanisms.
Contribution/Results: The study proposes a novel ten-concept classification framework that extends beyond traditional CE principles, emphasizing controlled experimentation, automated execution, and integrated risk mitigation strategies tailored for agile and DevOps contexts. It reveals a paradigm shiftβfrom ad hoc fault injection toward continuous, pipeline-embedded resilience validation within CI/CD workflows. The framework provides practitioners with a reusable, context-aware implementation guide and advances resilience engineering theory by grounding it in empirical industrial evidence, thereby informing future theoretical development and empirical research.
π Abstract
Chaos Engineering (CE) has emerged as a proactive method to improve the resilience of modern distributed systems, particularly within DevOps environments. Originally pioneered by Netflix, CE simulates real-world failures to expose weaknesses before they impact production. In this paper, we present a systematic gray literature review that investigates how industry practitioners have adopted and adapted CE principles over recent years. Analyzing 50 sources published between 2019 and early 2024, we developed a comprehensive classification framework that extends the foundational CE principles into ten distinct concepts. Our study reveals that while the core tenets of CE remain influential, practitioners increasingly emphasize controlled experimentation, automation, and risk mitigation strategies to align with the demands of agile and continuously evolving DevOps pipelines. Our results enhance the understanding of how CE is intended and implemented in practice, and offer guidance for future research and industrial applications aimed at improving system robustness in dynamic production environments.