Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Current vision-language models exhibit limited robustness in understanding low-level spatial intelligence—particularly interaction-oriented perception—within embodied 3D environments. To address this gap, this work introduces Embodied3DBench, the first robot-centric benchmark encompassing six tasks across two core categories: spatial structure understanding and interaction-aware perception, further subdivided into 12 subtasks. The benchmark includes 21,000 high-quality question-answer pairs and 1.3 million synthetically generated training samples. Through a multitask question-answering framework, the study systematically evaluates 13 state-of-the-art models, revealing their strength in high-level reasoning but pronounced weakness in interaction perception. Fine-tuning with the synthetic data substantially improves model performance on low-level spatial tasks, thereby validating the benchmark’s effectiveness and scalability.

📝 Abstract

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

Embodied AI

Vision Language Models

3D Spatial Reasoning

Interaction-Oriented Perception

Benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied3DBench

low-level spatial intelligence

vision language models