Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the 3D referring localization task in embodied intelligence, identifying a critical limitation of prevailing two-stage detection-driven paradigms: generic 3D detectors—without language instruction fine-tuning—already outperform dedicated referring models on category-level referring localization, revealing insufficient modeling of fine-grained semantic alignment. To address this, we propose DEGround, the first framework to unify detection and referring queries within a DETR architecture. It introduces a learnable region activation mechanism and a sentence-level semantic embedding–driven query modulation module, enabling joint optimization of detection and referring. Evaluated on the EmbodiedScan validation set, DEGround achieves a 7.52% absolute improvement in overall accuracy over the state-of-the-art BIP3D, demonstrating significantly enhanced contextual awareness and instruction comprehension capabilities.

Technology Category

Application Category

📝 Abstract
Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target category. Surprisingly, these detection models without any instruction-specific training outperform the grounding models explicitly trained with language instructions. This indicates that even category-level embodied 3D grounding may not be well resolved, let alone more fine-grained context-aware grounding. Motivated by this finding, we propose DEGround, which shares DETR queries as object representation for both DEtection and Grounding and enables the grounding to benefit from basic category classification and box detection. Based on this framework, we further introduce a regional activation grounding module that highlights instruction-related regions and a query-wise modulation module that incorporates sentence-level semantic into the query representation, strengthening the context-aware understanding of language instructions. Remarkably, DEGround outperforms state-of-the-art model BIP3D by 7.52% at overall accuracy on the EmbodiedScan validation set. The source code will be publicly available at https://github.com/zyn213/DEGround.
Problem

Research questions and friction points this paper is trying to address.

Evaluates if detection models suffice for embodied 3D grounding.
Proposes DEGround to enhance context-aware object localization.
Improves grounding accuracy by integrating language and detection.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shares DETR queries for detection and grounding
Introduces regional activation grounding module
Modulates queries with sentence-level semantics
🔎 Similar Papers
No similar papers found.
Y
Yani Zhang
School of Computer Science, Wuhan University
Dongming Wu
Dongming Wu
MMLab, CUHK; CPII
Computer VisionVision and LanguageMLLMEmbodied AI
H
Hao Shi
Tsinghua University
Yingfei Liu
Yingfei Liu
Megvii Technology
Tiancai Wang
Tiancai Wang
Dexmal
Computer VisionEmbodied AI
Haoqiang Fan
Haoqiang Fan
Megvii
computer vision
X
Xingping Dong
School of Computer Science, Wuhan University