🤖 AI Summary
Black-box classifiers suffer from degraded generalization—especially under out-of-distribution settings—due to spurious correlations. To address this, we propose a counterfactual (CF) alignment method that generates CF images perturbed with respect to a target classifier and evaluates the consistency of output responses across multiple heterogeneous classifiers. This enables unsupervised, model-agnostic localization and quantification of spurious correlation instances. Crucially, we introduce cross-model CF response consistency as a novel, principled criterion for identifying spurious correlations—a first in the literature—and demonstrate its utility for evaluating robustness interventions such as GroupDRO, JTT, and FLAC. We validate our approach on the Face-Attribute and Waterbird benchmarks, achieving high detection accuracy. Both visualizations and quantitative metrics exhibit strong agreement, confirming that our method reliably identifies spurious correlations and accurately measures their strength.
📝 Abstract
Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual (CF) alignment method to detect and quantify spurious correlations of black box classifiers. Our methodology is based on counterfactual images generated with respect to one classifier being input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists. This is validated by observing intuitive trends in face-attribute and waterbird classifiers, as well as by fabricating spurious correlations and detecting their presence, both visually and quantitatively. Furthermore, utilizing the CF alignment method, we demonstrate that we can evaluate robust optimization methods (GroupDRO, JTT, and FLAC) by detecting a reduction in spurious correlations.