๐ค AI Summary
This paper addresses the novel 3D Embodied Referential Understanding (3D-ERU) taskโlocalizing target objects in 3D scenes via joint grounding of natural language descriptions and human pointing gestures. Methodologically, we (1) introduce ImputeRefer, the first benchmark dataset for 3D-ERU, and Imputer, a data augmentation framework for gesture-imputed referential samples; and (2) propose Ges3ViG, an end-to-end multimodal model that fuses point clouds, linguistic input, and gesture cues via cross-modal attention, multimodal alignment, and a diffusion-based pointing-pose generation mechanism. Our contributions include: establishing 3D-ERU as a formal task paradigm; open-sourcing the first embodied, gesture-aware 3D multimodal benchmark and implementation; and achieving ~30% absolute accuracy improvement on 3D-ERU over prior baselines, with a +9% gain relative to language-only 3D grounding models.
๐ Abstract
3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at https://github.com/AtharvMane/Ges3ViG.