🤖 AI Summary
This work addresses the challenges of slot entanglement and weak slot-image alignment in object-centric learning with diffusion models. To this end, the authors propose the CODA framework, which introduces a novel register-slot mechanism to mitigate interference among slots and incorporates a contrastive alignment loss to explicitly strengthen the correspondence between slots and image regions. This approach effectively approximates the maximization of mutual information between slots and the input. Extensive experiments demonstrate that CODA significantly outperforms existing baselines on MOVi-C/E, Pascal VOC, and COCO datasets, achieving a 6.1% improvement in FG-ARI on COCO while maintaining computational efficiency and scalability.
📝 Abstract
Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot-image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes.