Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing slot-based unsupervised scene decomposition methods struggle to distinguish foreground from background, leading to severe background interference and limited instance discovery performance. To address this, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground and background. In Stage I, foreground is coarsely localized via clustering-based slot initialization and masked slot attention. In Stage II, a dual-slot competition mechanism and pseudo-mask guidance strategy jointly suppress oversegmentation and enhance foreground object consistency. FASA integrates patch-wise similarity graphs with self-supervised features, requiring no manual annotations. Extensive experiments on both synthetic and real-world datasets demonstrate substantial improvements over state-of-the-art methods. Our results validate the effectiveness and generalizability of explicit foreground modeling and pseudo-mask guidance for unsupervised structured scene decomposition.

Technology Category

Application Category

📝 Abstract
Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Separates foreground from background in unsupervised scene decomposition
Reduces background interference for better object discovery
Addresses over-segmentation using pseudo-mask guidance from image features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework separates foreground and background explicitly
Uses masked slot attention for object discovery with competition mechanism
Incorporates pseudo-mask guidance from self-supervised features to reduce over-segmentation
🔎 Similar Papers
No similar papers found.