Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing visual grounding methods, which predominantly rely on textual descriptions and struggle with linguistic ambiguity while neglecting non-linguistic cues such as pointing gestures. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale egocentric dataset for referential visual grounding that incorporates both hand-pointing annotations and dense semantic captions. We further propose SV-CoT, a novel framework that jointly models hand gestures and language by reframing the grounding task as a structured visual chain-of-thought reasoning process. Built upon a multimodal large language model and leveraging paired hand–target bounding box annotations, SV-CoT achieves an absolute accuracy improvement of 11.7% over state-of-the-art methods on benchmark evaluations, substantially enhancing agents’ capacity to interpret multimodal physical intent.
📝 Abstract
Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Visual Grounding
Referring Expressions
Egocentric Vision
Hand Pointing
Multimodal Deixis
Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric vision
hand pointing
visual grounding
multimodal reasoning
Visual Chain-of-Thought
🔎 Similar Papers
No similar papers found.
L
Ling Li
Tsinghua University, Beijing, China
Bowen Liu
Bowen Liu
Andreessen Horowitz, insitro, Stanford
Computational ChemistryDrug DiscoveryGraph Machine Learning
Z
Zinuo Zhan
Northwestern Polytechnical University, Xi’an, China
P
Peng Jie
Northwestern Polytechnical University, Xi’an, China
J
Jianhui Zhong
Dalian University of Technology, Dalian, China
K
Kenglun Chang
Apple, USA
Zhidong Deng
Zhidong Deng
Professor of Computer Science, Tsinghua University
Artificial IntelligenceSelf-drivingRoboticsIoTComputational Biology