🤖 AI Summary
Systematic learning from software failures remains critically underdeveloped in High-Reliability Organizations (HROs), resulting in recurrent failures. Method: This study employs a qualitative single-case design, conducting in-depth interviews with 10 software engineers at a national space research center, supplemented by cross-organizational validation interviews across five HROs. Thematic analysis reveals pervasive absence of structured processes for collecting, documenting, sharing, and applying failure knowledge. Contribution/Results: Three key barriers are identified: knowledge loss due to personnel turnover, fragmented documentation, and weak process enforcement. The study proposes an innovative “Process–Knowledge–Organization” tri-dimensional improvement framework, advocating (1) embedding learning mechanisms directly into development workflows, (2) establishing a searchable, semantically enriched failure knowledge repository, and (3) strengthening cross-functional collaboration and executive sponsorship. The framework delivers actionable, transferable practices to enhance reliability in safety-critical software systems.
📝 Abstract
Software failures can have significant consequences, making learning from failures a critical aspect of software engineering. While software organizations are recommended to conduct postmortems, the effectiveness and adoption of these practices vary widely. Understanding how engineers gather, document, share, and apply lessons from failures is essential for improving reliability and preventing recurrence. High-reliability organizations (HROs) often develop software systems where failures carry catastrophic risks, requiring continuous learning to ensure reliability. These organizations provide a valuable setting to examine practices and challenges for learning from software failures. Such insight could help develop processes and tools to improve reliability and prevent recurrence. However, we lack in-depth industry perspectives on the practices and challenges of learning from failures.
To address this gap, we conducted a case study through 10 in-depth interviews with research software engineers at a national space research center. We examine how they learn from failures: how they gather, document, share, and apply lessons. To assess transferability, we include data from 5 additional interviews at other HROs. Our findings provide insight into how engineers learn from failures in practice. To summarize: (1) failure learning is informal, ad hoc, and inconsistently integrated into SDLC; (2) recurring failures persist due to absence of structured processes; and (3) key challenges, including time constraints, knowledge loss from turnover and fragmented documentation, and weak process enforcement, undermine systematic learning. Our findings deepen understanding of how software engineers learn from failures and offer guidance for improving failure management practices.