Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

📅 2025-05-26

📈 Citations: 2

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Current vision-language models (VLMs) excel at complex multimodal tasks but exhibit severe deficiencies in atomic-level visual perception—particularly basic 2D Euclidean geometric reasoning (e.g., parallelism, collinearity). Method: We introduce the novel concept of “atomic visual skills” and propose the first fine-grained, interpretable framework for decomposing visual capabilities. We further release AVSD, a dedicated benchmark comprising both human-annotated and procedurally generated structured geometric reasoning tasks. Contribution/Results: Evaluating leading VLMs (LLaVA, Qwen-VL, Fuyu) under zero-shot and fine-tuning protocols on AVSD, we find their accuracy on elementary geometric judgments consistently falls below 65%, substantially underperforming humans. This work shifts VLM evaluation from composite tasks back to foundational perceptual primitives, establishing a new paradigm for interpretable modeling and targeted enhancement of visual competence.

Technology Category

Application Category

📝 Abstract

Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs on basic 2D geometry skills

Identifying VLMs' struggles with simple visual tasks

Highlighting need for atomic skill-focused datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose visual comprehension into atomic skills

Introduce Atomic Visual Skills Dataset (AVSD)

Benchmark VLMs on basic 2D geometry tasks

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?