Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of existing visual grounding methods, which predominantly rely on textual descriptions and struggle with linguistic ambiguity while neglecting non-linguistic cues such as pointing gestures. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale egocentric dataset for referential visual grounding that incorporates both hand-pointing annotations and dense semantic captions. We further propose SV-CoT, a novel framework that jointly models hand gestures and language by reframing the grounding task as a structured visual chain-of-thought reasoning process. Built upon a multimodal large language model and leveraging paired hand–target bounding box annotations, SV-CoT achieves an absolute accuracy improvement of 11.7% over state-of-the-art methods on benchmark evaluations, substantially enhancing agents’ capacity to interpret multimodal physical intent.

Technology Category

Application Category

📝 Abstract

Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Visual Grounding

Referring Expressions

Egocentric Vision

Hand Pointing

Multimodal Deixis

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric vision

hand pointing

visual grounding