🤖 AI Summary
This study addresses the lack of empirical understanding regarding the real-world adoption and evolution of chaos engineering tools in open-source ecosystems. We conduct the first large-scale empirical investigation, analyzing code, metadata, and temporal patterns across 971 GitHub repositories using 10 mainstream chaos engineering tools. Our method integrates quantitative repository analysis with layered fault-injection categorization (network, instance, and application layers) and lifecycle-stage mapping. Results reveal that Toxiproxy and Chaos Mesh dominate cloud-native contexts; fault injection is heavily concentrated at the network (40.9%) and instance (32.7%) layers, while application-layer injection remains critically underutilized (3.0%); and 58% of repositories apply chaos engineering during software development—not just production. This work bridges an industrial knowledge gap on practical chaos engineering deployment, uncovers systemic resilience deficits—particularly in application-layer fault modeling—and provides evidence-based guidance for tool design, practice standardization, and test strategy enhancement.
📝 Abstract
Chaos engineering aims to improve the resilience of software systems by intentionally injecting faults to identify and address system weaknesses that cause outages in production environments. Although many tools for chaos engineering exist, their practical adoption is not yet explored. This study examines 971 GitHub repositories that incorporate 10 popular chaos engineering tools to identify patterns and trends in their use. The analysis reveals that Toxiproxy and Chaos Mesh are the most frequently used, showing consistent growth since 2016 and reflecting increasing adoption in cloud-native development. The release of new chaos engineering tools peaked in 2018, followed by a shift toward refinement and integration, with Chaos Mesh and LitmusChaos leading in ongoing development activity. Software development is the most frequent application (58.0%), followed by unclassified purposes (16.2%), teaching (10.3%), learning (9.9%), and research (5.7%). Development-focused repositories tend to have higher activity, particularly for Toxiproxy and Chaos Mesh, highlighting their industrial relevance. Fault injection scenarios mainly address network disruptions (40.9%) and instance termination (32.7%), while application-level faults remain underrepresented (3.0%), highlighting for future exploration.