LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

šŸ“… 2024-09-26
šŸ›ļø arXiv.org
šŸ“ˆ Citations: 3
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
To address the prevalent lack of 3D scene understanding in large multimodal models (LMMs), this paper proposes LLaVA-3D—a framework enabling native 3D perception without compromising existing 2D vision-language capabilities. Methodologically, it integrates 3D positional embeddings into CLIP image patches to establish a unified 2D/3D visual representation and introduces a joint 2D/3D instruction-tuning paradigm to support dual-modality comprehension within a single architecture. Built upon the LLaVA backbone, LLaVA-3D incorporates 3D position encoding, CLIP-based visual features, and multimodal alignment mechanisms. Experiments demonstrate a 3.5Ɨ acceleration in training convergence, state-of-the-art performance on 3D understanding benchmarks, and retention of LLaVA-level accuracy on 2D image comprehension and multi-turn visual dialogue tasks.

Technology Category

Application Category

šŸ“ Abstract
Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to bring the 2D CLIP patches within a 3D spatial context. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.
Problem

Research questions and friction points this paper is trying to address.

Large Multimodal Models
3D Scene Understanding
Data Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLaVA-3D
Multimodal 2D-3D Understanding
Unified Model