π€ AI Summary
The layer-wise distribution mechanism of shortcuts (spurious correlations) in deep networks remains poorly understood, hindering principled mitigation strategies. To address this, we propose counterfactual inter-layer attribution to quantify each layerβs contribution to generalization degradation under clean versus biased data, conducting systematic analysis across VGG, ResNet, DeiT, and ConvNeXt on CIFAR-10, Waterbirds, and CelebA. We discover a cross-layer collaborative shortcut learning pattern: shallow layers predominantly encode spurious features, while deeper layers selectively forget core discriminative features from clean data. Leveraging this insight, we construct multi-dimensional perturbation axes for precise shortcut localization. Experiments reveal that shortcut effects permeate the entire network and exhibit strong dependence on both dataset and architecture, rendering generic mitigation strategies ineffective. Customized, architecture- and task-aware interventions are thus essential. This work establishes a novel paradigm for mechanistic modeling of shortcuts and targeted intervention.
π Abstract
Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.