DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

📅 2025-05-08
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for first-person 3D visual grounding in embodied intelligence suffer from fine-grained visual semantic loss due to sparse fusion of point clouds and multi-view images, and limited grounding accuracy stemming from context-poor natural language descriptions. Method: We propose a Hierarchical Scene Semantic Enhancer to preserve dense cross-modal visual structure, and a Language Semantic Enhancer leveraging large language models (LLMs) to generate diverse, context-rich training texts. Our approach integrates multimodal feature alignment, hierarchical point cloud–image fusion, LLM-driven textual augmentation, and cross-modal contrastive learning. Results: On standard benchmarks, our method achieves new state-of-the-art performance, improving full-set and subset test accuracy by 5.81% and 7.56%, respectively. It won both the championship and the Innovation Award in the Multi-View 3D Visual Grounding Track of the CVPR 2024 Autonomous Grand Challenge.

Technology Category

Application Category

📝 Abstract
Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in egocentric 3D visual grounding. Our method also achieves 1st place and receives the Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D visual grounding via dense language-vision semantics
Addressing sparse fusion loss in point cloud-image alignment
Improving textual context with diverse language descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Scene Semantic Enhancer for dense visual features
Language Semantic Enhancer using large language models
Cross-modal alignment for improved 3D visual grounding
🔎 Similar Papers
No similar papers found.
H
Henry Zheng
Department of Automation, BNRist, Tsinghua University
H
Hao Shi
Department of Automation, BNRist, Tsinghua University
Q
Qihang Peng
Department of Automation, BNRist, Tsinghua University
Y
Yong Xien Chng
Department of Automation, BNRist, Tsinghua University
R
Rui Huang
Department of Automation, BNRist, Tsinghua University
Yepeng Weng
Yepeng Weng
Researcher, Lenovo Research
Large Language ModelsComputer Vision
Z
Zhongchao Shi
AI Lab, Lenovo Research
G
Gao Huang
Department of Automation, BNRist, Tsinghua University