Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the lack of systematic definition and evaluation of fundamental spatial abilities (FSAs) in vision-language models (VLMs). Drawing on psychometrics, we propose the first five-dimensional framework for VLM spatial competence—encompassing spatial perception, spatial relations, orientation, mental rotation, and visualization—and design nine standardized cognitive tasks for comprehensive assessment. Evaluating 13 state-of-the-art VLMs, we observe a human-like hierarchical ability distribution across dimensions, yet low inter-dimensional correlation (Pearson *r* < 0.3) and counterintuitive phenomena (e.g., smaller models outperforming larger ones). Core bottlenecks are identified as weak geometric representation and absence of dynamic spatial simulation. Average model performance is 24.95 (vs. human 68.38); Qwen2-VL-7B achieves the highest score (30.82). Chain-of-thought reasoning and 5-shot fine-tuning yield marginal gains (+0.100, +0.259). Our diagnostic, extensible spatial intelligence benchmark establishes a methodological foundation for modeling spatial cognition in VLMs.

Technology Category

Application Category

📝 Abstract

The Theory of Multiple Intelligences underscores the hierarchical nature of cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13 mainstream VLMs through nine validated psychometric experiments reveals significant gaps versus humans (average score 24.95 vs. 68.38), with three key findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation, weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading (30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought (0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from architectural constraints. Identified barriers include weak geometry encoding and missing dynamic simulation. By linking psychometric BSAs to VLM capabilities, we provide a diagnostic toolkit for spatial intelligence evaluation, methodological foundations for embodied AI development, and a cognitive science-informed roadmap for achieving human-like spatial intelligence.

Problem

Research questions and friction points this paper is trying to address.

Evaluate spatial abilities in Visual Language Models.

Identify gaps between VLMs and human spatial intelligence.

Propose interventions to improve VLM spatial capabilities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines five Basic Spatial Abilities

Benchmarks 13 Visual Language Models

Proposes interventions for spatial intelligence

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?