🤖 AI Summary
While current vision foundation models excel at object localization, they exhibit significant deficiencies in object-centric spatial reasoning—such as relative positioning, grouping structure, and depth relationships—revealing a fundamental trade-off between precise localization and deep spatial cognition.
Method: We introduce the first synthetic-data-driven benchmark for spatial reasoning tailored to foundation models, decoupling localization accuracy, relational inference, and downstream retrieval tasks to systematically evaluate detection models (e.g., GroundingDINO, OWLv2) and multimodal large language models (e.g., GPT-4o, LLaVA).
Contribution/Results: Experiments show that detection models achieve high pixel-level localization fidelity but fail to model topological inter-object relationships; conversely, multimodal LLMs generate fluent natural-language descriptions yet lack geometric consistency. This work exposes critical limitations of existing models in real-world spatial understanding and establishes a reproducible, controllable, and scalable evaluation paradigm for fine-grained spatial intelligence research.
📝 Abstract
Spatial understanding is a critical capability for vision foundation models. While recent advances in large vision models or vision-language models (VLMs) have expanded recognition capabilities, most benchmarks emphasize localization accuracy rather than whether models capture how objects are arranged and related within a scene. This gap is consequential; effective scene understanding requires not only identifying objects, but reasoning about their relative positions, groupings, and depth. In this paper, we present a systematic benchmark for object-centric spatial reasoning in foundation models. Using a controlled synthetic dataset, we evaluate state-of-the-art vision models (e.g., GroundingDINO, Florence-2, OWLv2) and large VLMs (e.g., InternVL, LLaVA, GPT-4o) across three tasks: spatial localization, spatial reasoning, and downstream retrieval tasks. We find a stable trade-off: detectors such as GroundingDINO and OWLv2 deliver precise boxes with limited relational reasoning, while VLMs like SmolVLM and GPT-4o provide coarse layout cues and fluent captions but struggle with fine-grained spatial context. Our study highlights the gap between localization and true spatial understanding, and pointing toward the need for spatially-aware foundation models in the community.