🤖 AI Summary
Deep neural networks are prone to learning spurious correlations, leading to poor performance under distribution shifts or on minority subgroups. To address this, this work proposes a bilevel meta-learning approach that performs semantic editing of support-set samples within the frozen feature space of a pretrained backbone and adapts a lightweight classification head via a few inner-loop updates, thereby steering the model toward semantic features genuinely predictive of the target label. This method achieves, for the first time, efficient and stable feature-space meta-augmentation without requiring end-to-end retraining. Experiments demonstrate that training for just a few minutes on a single GPU substantially improves worst-group accuracy, and CLIP-based visualizations confirm that the edited features are semantically meaningful and effectively disentangled from spurious attributes.
📝 Abstract
Deep neural networks often rely on spurious features to make predictions, which makes them brittle under distribution shift and on samples where the spurious correlation does not hold (e.g., minority-group examples). Recent studies have shown that, even in such settings, the feature extractor of an Empirical Risk Minimization (ERM)-trained model can learn rich and informative representations, and that much of the failure may be attributed to the classifier head. In particular, retraining a lightweight head while keeping the backbone frozen can substantially improve performance on shifted distributions and minority groups. Motivated by this observation, we propose a bilevel meta-learning method that performs augmentation directly in feature space to improve spurious correlation handling in the classifier head. Our method learns support-side feature edits such that, after a small number of inner-loop updates on the edited features, the classifier achieves lower loss on hard examples and improved worst-group performance. By operating at the backbone output rather than in pixel space or through end-to-end optimization, the method is highly efficient and stable, requiring only a few minutes of training on a single GPU. We further validate our method with CLIP-based visualizations, showing that the learned feature-space updates induce semantically meaningful shifts aligned with spurious attributes.