RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

📅 2024-11-25
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) exhibit limited spatial reasoning and embodied interaction capabilities due to the absence of explicit spatial structure and reference-frame annotations in their training data. To address this, we introduce the first large-scale multimodal dataset explicitly designed for robotic spatial understanding—comprising one million 2D egocentric images, 5,000 3D indoor scans, and three million fine-grained spatial relation annotations, supporting joint modeling across egocentric, object-centric, and world-centric reference frames. We present the first end-to-end co-training framework unifying 2D vision-language modeling with 3D spatial understanding, and propose a spatial-relation–affordance joint modeling architecture to enhance scene grounding. Our approach achieves an average accuracy improvement of 12.7% over state-of-the-art methods on spatial relation prediction, affordance recognition, and robotic manipulation tasks—demonstrating the critical role of reference-frame-aware representations in embodied intelligence.

Technology Category

Application Category

📝 Abstract
Spatial understanding is a crucial capability for robots to make grounded decisions based on their environment. This foundational skill enables robots not only to perceive their surroundings but also to reason about and interact meaningfully within the world. In modern robotics, these capabilities are taken on by visual language models, and they face significant challenges when applied to spatial reasoning context due to their training data sources. These sources utilize general-purpose image datasets, and they often lack sophisticated spatial scene understanding capabilities. For example, the datasets do not address reference frame comprehension - spatial relationships require clear contextual understanding, whether from an ego-centric, object-centric, or world-centric perspective, which allow for effective real-world interaction. To address this issue, we introduce RoboSpatial, a large-scale spatial understanding dataset consisting of real indoor and tabletop scenes captured as 3D scans and egocentric images, annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M annotated spatial relationships, with paired 2D egocentric images and 3D scans to make it both 2D and 3D ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.
Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in vision-language models for robotics
Addressing lack of spatial understanding in general-purpose image datasets
Improving robot perception and interaction with 2D/3D spatial data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for spatial understanding
Combines 2D and 3D vision-language models
Rich annotated spatial relationships for robotics