🤖 AI Summary
To address the challenge that Slot Attention struggles to align slots with whole objects—rather than local parts—in real-world images, this paper proposes a text-guided generative slot learning framework. We first reinterpret a pretrained diffusion decoder as a semantic mask generator and jointly optimize object-level slot representations and image-text alignment within the diffusion reconstruction space. This enables weakly supervised semantic alignment, unifying segmentation, generation, and attribute prediction into a single multi-task model. Crucially, our method requires no pixel-level annotations, relying solely on image–text pairing signals. On the PASCAL VOC and COCO object discovery benchmarks, it achieves mIoU improvements of 35% and 10% over prior state-of-the-art methods, respectively, and sets a new record for FID among slot-based approaches. Moreover, its weakly supervised segmentation performance surpasses existing language-guided and dedicated weakly supervised models.
📝 Abstract
Slot attention aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable various downstream tasks. Yet, these slots often bind to object parts, not objects themselves, especially for real-world datasets. To address this, we introduce Guided Latent Slot Diffusion - GLASS, an object-centric model that uses generated captions as a guiding signal to better align slots with objects. Our key insight is to learn the slot-attention module in the space of generated images. This allows us to repurpose the pre-trained diffusion decoder model, which reconstructs the images from the slots, as a semantic mask generator based on the generated captions. GLASS learns an object-level representation suitable for multiple tasks simultaneously, e.g., segmentation, image generation, and property prediction, outperforming previous methods. For object discovery, GLASS achieves approx. a +35% and +10% relative improvement for mIoU over the previous state-of-the-art (SOTA) method on the VOC and COCO datasets, respectively, and establishes a new SOTA FID score for conditional image generation amongst slot-attention-based methods. For the segmentation task, GLASS surpasses SOTA weakly-supervised and language-based segmentation models, which were specifically designed for the task.