Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding

๐Ÿ“… 2025-04-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the novel 3D Embodied Referential Understanding (3D-ERU) taskโ€”localizing target objects in 3D scenes via joint grounding of natural language descriptions and human pointing gestures. Methodologically, we (1) introduce ImputeRefer, the first benchmark dataset for 3D-ERU, and Imputer, a data augmentation framework for gesture-imputed referential samples; and (2) propose Ges3ViG, an end-to-end multimodal model that fuses point clouds, linguistic input, and gesture cues via cross-modal attention, multimodal alignment, and a diffusion-based pointing-pose generation mechanism. Our contributions include: establishing 3D-ERU as a formal task paradigm; open-sourcing the first embodied, gesture-aware 3D multimodal benchmark and implementation; and achieving ~30% absolute accuracy improvement on 3D-ERU over prior baselines, with a +9% gain relative to language-only 3D grounding models.

Technology Category

Application Category

๐Ÿ“ Abstract
3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at https://github.com/AtharvMane/Ges3ViG.
Problem

Research questions and friction points this paper is trying to address.

Combining language and pointing gestures for 3D object identification
Addressing lack of datasets for 3D embodied reference understanding
Improving accuracy in 3D visual grounding with gesture integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates pointing gestures into 3D visual grounding
Uses data augmentation framework Imputer for dataset creation
Achieves 30% accuracy improvement over existing models
๐Ÿ”Ž Similar Papers
No similar papers found.