🤖 AI Summary
Deep learning models often suffer from degraded generalization robustness due to spurious correlations between the target and non-causal features (e.g., background, local textures). Existing debiasing methods rely on costly, coarse-grained group annotations indicating spurious correlations. This paper proposes a post-hoc, annotation-free framework for mitigating spurious bias. First, it automatically identifies prediction shortcuts via gradient sensitivity analysis and latent-space perturbation probing. Then, it performs retraining with adversarial invariance regularization to suppress reliance on spurious features. Unlike prior approaches, our method enables fine-grained localization of non-robust patterns without requiring explicit group labels. Extensive experiments across multiple benchmark datasets demonstrate substantial improvements in model robustness under distribution shifts. The framework incurs zero annotation cost, achieves high computational efficiency, and exhibits strong practical applicability.
📝 Abstract
Deep learning models often achieve high performance by inadvertently learning spurious correlations between targets and non-essential features. For example, an image classifier may identify an object via its background that spuriously correlates with it. This prediction behavior, known as spurious bias, severely degrades model performance on data that lacks the learned spurious correlations. Existing methods on spurious bias mitigation typically require a variety of data groups with spurious correlation annotations called group labels. However, group labels require costly human annotations and often fail to capture subtle spurious biases such as relying on specific pixels for predictions. In this paper, we propose a novel post hoc spurious bias mitigation framework without requiring group labels. Our framework, termed ShortcutProbe, identifies prediction shortcuts that reflect potential non-robustness in predictions in a given model's latent space. The model is then retrained to be invariant to the identified prediction shortcuts for improved robustness. We theoretically analyze the effectiveness of the framework and empirically demonstrate that it is an efficient and practical tool for improving a model's robustness to spurious bias on diverse datasets.