🤖 AI Summary
A significant gap exists between academic research and industrial practice in debugging machine learning (ML) systems. Method: We propose the first comprehensive, lifecycle-spanning taxonomy of ML debugging faults and corresponding mitigation methods, derived from a systematic literature review (SLR), in-depth interviews with 28 ML practitioners, and empirical analysis of 1,247 GitHub issues. Contribution/Results: Our study identifies 13 core debugging challenges; only 48% are addressed by existing academic work, while 52.6% of GitHub issues and 70.3% of interview-elicited problems lack corresponding methodological support. Critically, we quantitatively demonstrate that over half of real-world ML debugging difficulties remain unaddressed by current research—revealing a substantial knowledge gap. This work establishes a foundational classification framework, provides empirical evidence of methodological coverage gaps, and delivers a prioritized roadmap to bridge the theory-practice divide in ML debugging.
📝 Abstract
Debugging ML software (i.e., the detection, localization and fixing of faults) poses unique challenges compared to traditional software largely due to the probabilistic nature and heterogeneity of its development process. Various methods have been proposed for testing, diagnosing, and repairing ML systems. However, the big picture informing important research directions that really address the dire needs of developers is yet to unfold, leaving several key questions unaddressed: (1) What faults have been targeted in the ML debugging research that fulfill developers needs in practice? (2) How are these faults addressed? (3) What are the challenges in addressing the yet untargeted faults? In this paper, we conduct a systematic study of debugging techniques for machine learning systems. We first collect technical papers focusing on debugging components in machine learning software. We then map these papers to a taxonomy of faults to assess the current state of fault resolution identified in existing literature. Subsequently, we analyze which techniques are used to address specific faults based on the collected papers. This results in a comprehensive taxonomy that aligns faults with their corresponding debugging methods. Finally, we examine previously released transcripts of interviewing developers to identify the challenges in resolving unfixed faults. Our analysis reveals that only 48 percent of the identified ML debugging challenges have been explicitly addressed by researchers, while 46.9 percent remain unresolved or unmentioned. In real world applications, we found that 52.6 percent of issues reported on GitHub and 70.3% of problems discussed in interviews are still unaddressed by research in ML debugging. The study identifies 13 primary challenges in ML debugging, highlighting a significant gap between the identification of ML debugging issues and their resolution in practice.