🤖 AI Summary
Dual-encoder vision-language models (VLMs), such as CLIP, exhibit poor compositional generalization in image–text retrieval—particularly in modeling attribute–object bindings—and behave akin to bag-of-words models. To address this, we propose a **lightweight, fine-tuning-free, inference-time structured alignment method**: images are partitioned into local regions, while texts are parsed into object, attribute, and relation fragments, enabling cross-modal matching between fine-grained visual and linguistic units; final similarity scores are computed via weighted aggregation. This is the first work to systematically demonstrate that **structured decomposition at inference time alone—without architectural or training modifications—significantly enhances compositional generalization in VLMs**. Our method is fully compatible with standard dual-encoder models (e.g., CLIP) and consistently improves retrieval performance on both controlled and natural benchmarks, achieving up to a 12.7% absolute gain in attribute–object composition accuracy. These results underscore the substantial untapped potential of inference-time structural optimization for VLMs.
📝 Abstract
Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.