Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the limited compositional generalization of dual-encoder vision-language models, often attributed to insufficient representational capacity. The authors identify global embedding matching as a key bottleneck and propose a novel paradigm that avoids fine-tuning pretrained encoders altogether. Instead, a lightweight Transformer learns fine-grained, local alignments directly from frozen image patch and text token embeddings. This approach achieves substantial improvements over full fine-tuning and other end-to-end methods on multiple controlled out-of-distribution compositional benchmarks, while preserving strong in-domain retrieval performance—effectively balancing in-domain effectiveness with robust compositional generalization.

Technology Category

Application Category

📝 Abstract
Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.
Problem

Research questions and friction points this paper is trying to address.

compositionality
dual-encoder vision-language models
inference protocol
global embedding matching
compositional generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional generalization
dual-encoder VLMs
fine-grained alignment
frozen representations
inference protocol