๐ค AI Summary
This work addresses the persistent challenge in vision-language models (VLMs) of conflating object attributes with spatial relationships, particularly their inability to reliably distinguish between compositional structures such as โa red cube and a blue sphereโ versus โa blue cube and a red sphere.โ To this end, we propose Auto-Comp, the first fully automated synthetic data generation pipeline that supports controllable variables and produces minimally designed yet contextually enriched imageโtext pairs for fine-grained, scalable evaluation of compositional reasoning. Our analysis reveals a paradoxical influence of visual-language context on spatial reasoning and attribute binding, and demonstrates that VLMs exhibit high sensitivity to low-entropy distractors. Extensive experiments across 20 prominent models confirm the ubiquity of this deficiency. We publicly release the Auto-Comp pipeline and the complete benchmark suite to advance interpretable evaluation of vision-language models.
๐ Abstract
Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing"a red cube and a blue sphere"with"a blue cube and a red sphere". Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g.,"a monitor to the left of a bicycle on a white background") and LLM-generated Contextual captions (e.g.,"In a brightly lit photography studio, a monitor is positioned to the left of a bicycle"), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel"Confusion Benchmark"reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (https://huggingface.co/AutoComp).