Adding simple structure at inference improves Vision-Language Compositionality

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Dual-encoder vision-language models (VLMs), such as CLIP, exhibit poor compositional generalization in image–text retrieval—particularly in modeling attribute–object bindings—and behave akin to bag-of-words models. To address this, we propose a **lightweight, fine-tuning-free, inference-time structured alignment method**: images are partitioned into local regions, while texts are parsed into object, attribute, and relation fragments, enabling cross-modal matching between fine-grained visual and linguistic units; final similarity scores are computed via weighted aggregation. This is the first work to systematically demonstrate that **structured decomposition at inference time alone—without architectural or training modifications—significantly enhances compositional generalization in VLMs**. Our method is fully compatible with standard dual-encoder models (e.g., CLIP) and consistently improves retrieval performance on both controlled and natural benchmarks, achieving up to a 12.7% absolute gain in attribute–object composition accuracy. These results underscore the substantial untapped potential of inference-time structural optimization for VLMs.

Technology Category

Application Category

📝 Abstract

Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.

Problem

Research questions and friction points this paper is trying to address.

Improving vision-language compositionality in dual encoder models

Enhancing image-text retrieval via inference-time structural techniques

Addressing attribute-object binding challenges without model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide image into smaller crops for analysis

Extract text segments for object-attribute relations

Aggregate similarities from matched segments

🔎 Similar Papers

No similar papers found.