🤖 AI Summary
Existing ML monitoring systems detect data distribution shifts but struggle to automatically identify their root causes—such as data defects, software failures, or genuine concept drift—relying instead on manual root-cause analysis. To address this, we propose a causal graph framework for ML systems, the first to integrate causal modeling with hierarchical system abstraction. It constructs a multi-granularity dependency graph spanning environmental variables, data pipelines, and model components. By analyzing propagation paths, the framework enables traceable attribution of distribution shifts back to their origins. This systematically establishes a mapping between observed distributional changes and their underlying causes, advancing ML monitoring from merely detecting *whether* a shift occurs to explaining *why* it occurs. The framework provides both theoretical foundations and a practical technical pathway toward interpretable, traceable, and automated ML monitoring. (149 words)
📝 Abstract
Monitoring machine learning (ML) systems is hard, with standard practice focusing on detecting distribution shifts rather than their causes. Root-cause analysis often relies on manual tracing to determine whether a shift is caused by software faults, data-quality issues, or natural change. We propose ML System Maps -- causal maps that, through layered views, make explicit the propagation paths between the environment and the ML system's internals, enabling systematic attribution of distribution shifts. We outline the approach and a research agenda for its development and evaluation.