Object-centric Binding in Contrastive Language-Image Pretraining

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing vision-language models (e.g., CLIP) struggle to accurately comprehend complex compositional scenes involving multiple objects and their spatial relationships. To address this, we propose an object-centric binding mechanism that aligns text-generated scene graphs with slot-based visual representations, enabling structured cross-modal similarity modeling. Our approach is the first to explicitly integrate scene graph inductive biases into a slot attention framework without relying on hard negative sampling, while additionally introducing text-conditioned visual relational constraints. This design significantly enhances multi-object compositional reasoning and fine-grained image–text matching. We achieve state-of-the-art performance on benchmarks including RefCOCO+ and CLEVR-Single, and improve training sample efficiency by 37%.

Technology Category

Application Category

📝 Abstract

Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.

Problem

Research questions and friction points this paper is trying to address.

Improve compositional scene understanding in VLMs

Enhance multi-object spatial relationship comprehension

Facilitate accurate image-text matching in complex scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates inductive biases in CLIP-like models

Introduces binding module for scene graphs

Uses text-conditioned visual constraints

🔎 Similar Papers

RankCLIP: Ranking-Consistent Language-Image Pretraining