🤖 AI Summary
Current vision-language models (VLMs) excel at complex multimodal tasks but exhibit severe deficiencies in atomic-level visual perception—particularly basic 2D Euclidean geometric reasoning (e.g., parallelism, collinearity). Method: We introduce the novel concept of “atomic visual skills” and propose the first fine-grained, interpretable framework for decomposing visual capabilities. We further release AVSD, a dedicated benchmark comprising both human-annotated and procedurally generated structured geometric reasoning tasks. Contribution/Results: Evaluating leading VLMs (LLaVA, Qwen-VL, Fuyu) under zero-shot and fine-tuning protocols on AVSD, we find their accuracy on elementary geometric judgments consistently falls below 65%, substantially underperforming humans. This work shifts VLM evaluation from composite tasks back to foundational perceptual primitives, establishing a new paradigm for interpretable modeling and targeted enhancement of visual competence.
📝 Abstract
Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.