🤖 AI Summary
When deploying machine learning models across heterogeneous environments, performance degradation often exhibits subgroup-specific heterogeneity—yet existing methods either explain only mean-level distributional shifts or isolate vulnerable subgroups without jointly identifying *where* degradation occurs and *why* it arises. This paper introduces SHIFT, the first hierarchical inference framework that unifies subgroup scanning, hierarchical causal inference, variable subset sensitivity analysis, and interpretable shift attribution. SHIFT simultaneously enables precise identification of degraded subgroups and disentanglement of underlying causes—distinguishing covariate shift from outcome shift. Evaluated on real-world deployments, SHIFT generates human-interpretable attributions of performance degradation and guides targeted interventions: it significantly improves performance for affected subgroups while avoiding negative transfer to others.
📝 Abstract
Machine learning (ML) models frequently experience performance degradation when deployed in new contexts. Such degradation is rarely uniform: some subgroups may suffer large performance decay while others may not. Understanding where and how large differences in performance arise is critical for designing targeted corrective actions that mitigate decay for the most affected subgroups while minimizing any unintended effects. Current approaches do not provide such detailed insight, as they either (i) explain how average performance shifts arise or (ii) identify adversely affected subgroups without insight into how this occurred. To this end, we introduce a Subgroup-scanning Hierarchical Inference Framework for performance drifT (SHIFT). SHIFT first asks"Is there any subgroup with unacceptably large performance decay due to covariate/outcome shifts?"(Where?) and, if so, dives deeper to ask"Can we explain this using more detailed variable(subset)-specific shifts?"(How?). In real-world experiments, we find that SHIFT identifies interpretable subgroups affected by performance decay, and suggests targeted actions that effectively mitigate the decay.