Guided Latent Slot Diffusion for Object-Centric Learning

📅 2024-07-25
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
To address the challenge that Slot Attention struggles to align slots with whole objects—rather than local parts—in real-world images, this paper proposes a text-guided generative slot learning framework. We first reinterpret a pretrained diffusion decoder as a semantic mask generator and jointly optimize object-level slot representations and image-text alignment within the diffusion reconstruction space. This enables weakly supervised semantic alignment, unifying segmentation, generation, and attribute prediction into a single multi-task model. Crucially, our method requires no pixel-level annotations, relying solely on image–text pairing signals. On the PASCAL VOC and COCO object discovery benchmarks, it achieves mIoU improvements of 35% and 10% over prior state-of-the-art methods, respectively, and sets a new record for FID among slot-based approaches. Moreover, its weakly supervised segmentation performance surpasses existing language-guided and dedicated weakly supervised models.

Technology Category

Application Category

📝 Abstract
Slot attention aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable various downstream tasks. Yet, these slots often bind to object parts, not objects themselves, especially for real-world datasets. To address this, we introduce Guided Latent Slot Diffusion - GLASS, an object-centric model that uses generated captions as a guiding signal to better align slots with objects. Our key insight is to learn the slot-attention module in the space of generated images. This allows us to repurpose the pre-trained diffusion decoder model, which reconstructs the images from the slots, as a semantic mask generator based on the generated captions. GLASS learns an object-level representation suitable for multiple tasks simultaneously, e.g., segmentation, image generation, and property prediction, outperforming previous methods. For object discovery, GLASS achieves approx. a +35% and +10% relative improvement for mIoU over the previous state-of-the-art (SOTA) method on the VOC and COCO datasets, respectively, and establishes a new SOTA FID score for conditional image generation amongst slot-attention-based methods. For the segmentation task, GLASS surpasses SOTA weakly-supervised and language-based segmentation models, which were specifically designed for the task.
Problem

Research questions and friction points this paper is trying to address.

Decompose images into meaningful object representations
Improve object-centric learning for complex real-world scenes
Enhance slot embeddings for diverse downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Guided Latent Slot Diffusion for object-centric learning
Semantic and instance guidance for better embeddings
Compositional generation of complex realistic scenes
🔎 Similar Papers
No similar papers found.