🤖 AI Summary
To address the poor robustness of image recognition models caused by over-reliance on background cues, this paper proposes a “segment-then-recognize” decoupling paradigm. Specifically, it leverages a zero-shot segmentation model to explicitly separate foreground and background, independently modeling their features, and fusing them via a learnable weighted mechanism for joint inference. This work is the first to integrate zero-shot segmentation into the recognition pipeline, effectively suppressing background bias while preserving contextual information—thereby enhancing both robustness and interpretability. On standard benchmarks, the method achieves state-of-the-art in-distribution accuracy and demonstrates significant improvements in generalization to out-of-distribution scenarios, including natural adversarial perturbations and background domain shifts. These results validate the effectiveness and feasibility of foreground-background segmentation as a prerequisite for robust recognition.
📝 Abstract
In image recognition, both foreground (FG) and background (BG) play an important role; however, standard deep image recognition often leads to unintended over-reliance on the BG, limiting model robustness in real-world deployment settings. Current solutions mainly suppress the BG, sacrificing BG information for improved generalization. We propose"Segment to Recognize Robustly"(S2R^2), a novel recognition approach which decouples the FG and BG modelling and combines them in a simple, robust, and interpretable manner. S2R^2 leverages recent advances in zero-shot segmentation to isolate the FG and the BG before or during recognition. By combining FG and BG, potentially also with a standard full-image classifier, S2R^2 achieves state-of-the-art results on in-domain data while maintaining robustness to BG shifts. The results confirm that segmentation before recognition is now possible.