Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models

๐Ÿ“… 2026-02-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the persistent challenge in vision-language models (VLMs) of conflating object attributes with spatial relationships, particularly their inability to reliably distinguish between compositional structures such as โ€œa red cube and a blue sphereโ€ versus โ€œa blue cube and a red sphere.โ€ To this end, we propose Auto-Comp, the first fully automated synthetic data generation pipeline that supports controllable variables and produces minimally designed yet contextually enriched imageโ€“text pairs for fine-grained, scalable evaluation of compositional reasoning. Our analysis reveals a paradoxical influence of visual-language context on spatial reasoning and attribute binding, and demonstrates that VLMs exhibit high sensitivity to low-entropy distractors. Extensive experiments across 20 prominent models confirm the ubiquity of this deficiency. We publicly release the Auto-Comp pipeline and the complete benchmark suite to advance interpretable evaluation of vision-language models.

Technology Category

Application Category

๐Ÿ“ Abstract
Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing"a red cube and a blue sphere"with"a blue cube and a red sphere". Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g.,"a monitor to the left of a bicycle on a white background") and LLM-generated Contextual captions (e.g.,"In a brightly lit photography studio, a monitor is positioned to the left of a bicycle"), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel"Confusion Benchmark"reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (https://huggingface.co/AutoComp).
Problem

Research questions and friction points this paper is trying to address.

compositional reasoning
vision-language models
attribute binding
spatial relations
visual-linguistic context
Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional reasoning
vision-language models
automated benchmarking
attribute binding
confusion benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.