ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
Current benchmarks for visual spatial intelligence suffer from systematic flaws—such as input mismatch and unanswerable questions—when applied to modern vision-language models (VLMs), leading to distorted evaluations. This work proposes the first evaluation benchmark tailored to real-world video inputs for VLMs, featuring 3D re-annotations of 381 scenes and high-quality, answerable, and unbiased question-answer pairs generated with specialized tools. The benchmark incorporates a multi-frame sampling protocol and fine-grained visibility metadata, ensuring every question is solvable given the actual input and supporting configurable frame rates. Rigorous human validation and bias-mitigation mechanisms enhance data reliability. Experiments uncover hidden failure modes of both general-purpose and specialized VLMs in spatial reasoning, substantially improving the diagnostic power and ecological validity of spatial intelligence assessment.

Technology Category

Application Category

📝 Abstract
Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence
vision-language models
3D reasoning
evaluation benchmark
video-based QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial intelligence evaluation
vision-language models
3D reasoning
re-annotation
frame-budget-aware benchmark
🔎 Similar Papers
No similar papers found.