🤖 AI Summary
This work addresses the limitations of existing visual grounding methods, which predominantly rely on textual descriptions and struggle with linguistic ambiguity while neglecting non-linguistic cues such as pointing gestures. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale egocentric dataset for referential visual grounding that incorporates both hand-pointing annotations and dense semantic captions. We further propose SV-CoT, a novel framework that jointly models hand gestures and language by reframing the grounding task as a structured visual chain-of-thought reasoning process. Built upon a multimodal large language model and leveraging paired hand–target bounding box annotations, SV-CoT achieves an absolute accuracy improvement of 11.7% over state-of-the-art methods on benchmark evaluations, substantially enhancing agents’ capacity to interpret multimodal physical intent.
📝 Abstract
Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.