🤖 AI Summary
Existing object-centric representation learning models can unsupervisedly discover scene objects but lack language controllability, hindering targeted extraction of specific object instances via natural language instructions. To address this, we propose the first language-driven object-centric representation framework. Our method introduces slot-language cross-modulation and CLIP feature-guided differentiable attention routing to achieve end-to-end self-supervised semantic binding and cross-modal alignment. Crucially, it operates without pixel-level mask supervision, enabling text-directed object localization and instance-level representation generation. Evaluated on real-world complex scenes, our approach significantly improves language-guided object extraction accuracy, enhances instance fidelity in text-to-image generation, and achieves state-of-the-art performance on visual question answering. The framework bridges the gap between object-centric learning and grounded language understanding, enabling precise, interpretable, and controllable scene decomposition through natural language.
📝 Abstract
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called"slots"or"object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.