CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current vision-language models (VLMs) excel at complex visual instruction following and multi-image understanding, yet mainstream benchmarks predominantly evaluate spatially isolated or discrete-view images, neglecting the geometric coherence inherent in image sequences captured from continuous viewpoint changes—termed “continuous spatial perception.” Method: We introduce CoSpace, the first benchmark explicitly designed to assess this capability, built upon real-world, continuously captured multi-view image sequences and systematically defining seven spatial reasoning tasks. We evaluate VLMs using joint multi-image encoding, explicit spatial relation modeling, and cross-image consistency constraints. Contribution/Results: Evaluating 19 state-of-the-art VLMs reveals a pervasive deficiency in continuous spatial perception: while accuracy remains comparable across models, their response consistency across sequential views varies significantly. CoSpace establishes a reproducible, task-grounded standard for evaluating and improving VLMs’ spatial robustness and geometric reasoning fidelity.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.

Problem

Research questions and friction points this paper is trying to address.

Assessing Continuous Space Perception in Vision-Language Models

Evaluating VLMs on spatially continuous image comprehension

Identifying pitfalls in VLMs' spatial reasoning consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

CoSpace benchmark for continuous space perception

Multi-image understanding with static viewpoint images

Evaluation of 19 VLMs for spatial consistency

🔎 Similar Papers

Understanding Depth and Height Perception in Large Visual-Language Models