DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

📅 2025-05-08

📈 Citations: 1

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing methods for first-person 3D visual grounding in embodied intelligence suffer from fine-grained visual semantic loss due to sparse fusion of point clouds and multi-view images, and limited grounding accuracy stemming from context-poor natural language descriptions. Method: We propose a Hierarchical Scene Semantic Enhancer to preserve dense cross-modal visual structure, and a Language Semantic Enhancer leveraging large language models (LLMs) to generate diverse, context-rich training texts. Our approach integrates multimodal feature alignment, hierarchical point cloud–image fusion, LLM-driven textual augmentation, and cross-modal contrastive learning. Results: On standard benchmarks, our method achieves new state-of-the-art performance, improving full-set and subset test accuracy by 5.81% and 7.56%, respectively. It won both the championship and the Innovation Award in the Multi-View 3D Visual Grounding Track of the CVPR 2024 Autonomous Grand Challenge.

Technology Category

Application Category

📝 Abstract

Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in egocentric 3D visual grounding. Our method also achieves 1st place and receives the Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.

Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D visual grounding via dense language-vision semantics

Addressing sparse fusion loss in point cloud-image alignment

Improving textual context with diverse language descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Scene Semantic Enhancer for dense visual features

Language Semantic Enhancer using large language models

Cross-modal alignment for improved 3D visual grounding

🔎 Similar Papers

No similar papers found.