🤖 AI Summary
This work identifies object scale (ROI-to-image ratio) and spatial location (eccentricity) as primary spatial biases that induce spurious background correlations in vision models. To systematically investigate this, we introduce Hard-Spurious-ImageNet—the first synthetic benchmark enabling controlled disentanglement of scale, position, and background—designed to rigorously evaluate robustness of mainstream ImageNet models (e.g., ResNet, ViT). Experiments reveal a severe degradation (>40% drop in worst-group accuracy) when objects are both small and highly eccentric. Critically, existing bias-mitigation methods (e.g., IRM, GroupDRO) improve worst-group accuracy by less than 2% under these conditions, exposing their fundamental failure to model spatial structural biases. This study is the first to establish spatial dimensions as a core causal factor behind spurious correlations. We propose an interpretable, controllable synthetic diagnostic framework, providing a new benchmark and analytical paradigm for robust visual representation learning.
📝 Abstract
Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change.