Unifying 2D and 3D Vision-Language Understanding

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Progress in 3D vision-language understanding is hindered by the scarcity of large-scale annotated 3D data and a substantial modality gap compared to 2D vision-language tasks. Method: We propose UniVLG, a unified architecture featuring a Transformer-based multimodal encoder, a novel language-conditioned shared masked decoder, and geometry-aligned 2D-to-3D feature enhancement—eliminating reliance on 3D mesh reconstruction or ground-truth proposals. It employs end-to-end joint 2D/3D contrastive learning and grounding loss for co-training. Contribution/Results: UniVLG achieves state-of-the-art performance across multiple 3D vision-language localization benchmarks while maintaining or even improving 2D performance. It supports unified inference on both RGB and RGB-D inputs, enabling end-to-end evaluation in realistic embodied AI scenarios. This work bridges the 2D–3D modality gap without domain-specific supervision and establishes a new paradigm for holistic 2D/3D vision-language understanding.

Technology Category

Application Category

📝 Abstract

Progress in 3D vision-language learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation. Code and additional visualizations are available at $href{https://univlg.github.io}{univlg.github.io}$.

Problem

Research questions and friction points this paper is trying to address.

Bridges 2D and 3D vision-language understanding gaps.

Enhances 3D performance using 2D data and lifting strategies.

Achieves state-of-the-art in 3D vision-language grounding tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified architecture for 2D and 3D vision-language understanding

Language-conditioned mask decoder for RGB and RGB-D images

2D-to-3D lifting strategies to enhance 3D performance

🔎 Similar Papers

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness