🤖 AI Summary
This work addresses the vulnerability of semantic segmentation models to spurious correlations under distribution shift, which often leads to semantic label flipping—where correctly segmented regions are assigned incorrect class labels. The study provides the first systematic characterization of this issue by decomposing foreground errors into three categories: correct predictions, label flips, and missed detections. It introduces flip-risk, a novel inference-time diagnostic metric that quantifies label-flip susceptibility based on foreground identity uncertainty, without requiring ground-truth annotations. Experiments demonstrate that stronger correlations between object categories and scene contexts during training lead to more pronounced label-flipping errors on counterfactual test samples. The proposed flip-risk effectively identifies high-risk instances, offering a practical tool for evaluating model robustness in the presence of distributional shifts.
📝 Abstract
The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk' score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at https://github.com/acharaakshit/label-flips.