🤖 AI Summary
Machine learning models often fail to generalize to minority groups due to spurious correlations in training data. This work identifies the root cause as “noisy memorization” of spurious features by a small subset of neurons, inducing inter-group performance disparity. We present the first empirical evidence that specific neurons selectively encode minority-group information and drive spurious-association generalization failure—leading us to propose the novel *spurious memory localization* hypothesis. Guided by this, we design an intervention framework that actively suppresses such memorization during training. Our approach integrates neuron- and channel-level attribution analysis, spurious-memory quantification, targeted pruning, and regularization. Evaluated on ResNet and ViT architectures across Waterbirds and CelebA benchmarks, it improves minority-group accuracy by 12.3–18.7% while preserving majority-group performance. These results demonstrate that suppressing spurious memory effectively decouples model robustness from majority-group bias.
📝 Abstract
Machine learning models often rely on simple spurious features -- patterns in training data that correlate with targets but are not causally related to them, like image backgrounds in foreground classification. This reliance typically leads to imbalanced test performance across minority and majority groups. In this work, we take a closer look at the fundamental cause of such imbalanced performance through the lens of memorization, which refers to the ability to predict accurately on extit{atypical} examples (minority groups) in the training set but failing in achieving the same accuracy in the testing set. This paper systematically shows the ubiquitous existence of spurious features in a small set of neurons within the network, providing the first-ever evidence that memorization may contribute to imbalanced group performance. Through three experimental sources of converging empirical evidence, we find the property of a small subset of neurons or channels in memorizing minority group information. Inspired by these findings, we articulate the hypothesis: the imbalanced group performance is a byproduct of ``noisy'' spurious memorization confined to a small set of neurons. To further substantiate this hypothesis, we show that eliminating these unnecessary spurious memorization patterns via a novel framework during training can significantly affect the model performance on minority groups. Our experimental results across various architectures and benchmarks offer new insights on how neural networks encode core and spurious knowledge, laying the groundwork for future research in demystifying robustness to spurious correlation.