LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the critical deficiency of vision-language models (VLMs) in spatial reasoning. We systematically evaluate their performance on absolute spatial tasks (e.g., left/right object localization in images) and 3D spatial tasks (e.g., motion direction estimation and rotation recognition). To this end, we introduce the first fully synthetic, low-cost, zero-data-contamination spatial perception benchmark—explicitly decomposing spatial understanding into two orthogonal dimensions: absolute position and 3D motion. Our benchmark employs controllable, interpretable synthetic stimuli with precise ground-truth spatial annotations. Extensive experiments across state-of-the-art VLMs reveal severe limitations: top-performing models underperform humans substantially across most spatial reasoning tasks, with accuracy dropping near zero on several core benchmarks—indicating fundamental architectural or training-induced deficits. This work establishes a standardized evaluation framework and a rigorous diagnostic tool for assessing and advancing spatial cognition in VLMs.

Technology Category

Application Category

📝 Abstract

Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' spatial perception in left-right-rotate tasks

Assessing VLMs' absolute and 3D spatial understanding gaps

Benchmarking synthetic spatial datasets for VLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset for low-cost spatial evaluation

Benchmark for absolute and 3D spatial understanding

Evaluation pipeline for Vision-Language Models

🔎 Similar Papers

No similar papers found.