Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the bottleneck of scarce large-scale vision-language paired data for 3D scene understanding, this paper proposes a Reconstructive Visual Instruction Tuning framework. Our method requires no 3D language annotations and explicitly injects 3D geometric priors into large multimodal models via dual supervision: cross-view masked reconstruction and global bird’s-eye view (BEV) generation. Innovatively integrating 3D-aware instruction tuning with semi-supervised learning, it enables efficient pretraining on purely visual 3D data. Evaluated on multiple 3D scene understanding benchmarks, our approach achieves state-of-the-art performance. Semi-supervised experiments demonstrate that using only 10% labeled data suffices to approach fully supervised results, significantly enhancing spatial reasoning and cross-view generalization capabilities.

Technology Category

Application Category

📝 Abstract

The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

Problem

Research questions and friction points this paper is trying to address.

Adapting 2D LMMs for 3D scene interpretation

Addressing lack of large-scale 3D vision-language datasets

Integrating 3D-aware visual supervision via reconstructive training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 3D-aware visual supervision

Uses cross-view and global-view reconstruction

Leverages unlabeled 3D vision-only data

🔎 Similar Papers

No similar papers found.

Authors to Follow