Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current embodied multimodal large language models often rely on linguistic shortcuts when processing referential language, neglecting fine-grained spatiotemporal alignment between speech and visual gestures, thereby failing to genuinely assess coreference understanding. To address this, this work proposes the Egocentric Co-Speech Grounding (EcoG) task, which requires models to jointly predict the “What,” “Where,” and “When” of referred objects. We introduce EcoG-Bench, the first egocentric benchmark for this task, comprising 811 bilingual (Chinese–English) video clips annotated with millisecond-level gesture labels, dense spatial bounding boxes, precise timestamps, and word-level ASR transcripts. Experiments show that humans achieve a strict accuracy of 96.9% on this benchmark, while the best current model reaches only 17.0%; incorporating timestamp-aligned frames and accurate ASR improves performance to 42.9%, highlighting the critical role of interface design in enabling temporal cue perception.

Technology Category

Application Category

📝 Abstract
In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.
Problem

Research questions and friction points this paper is trying to address.

deictic interaction
audio-visual alignment
egocentric grounding
co-speech gesture
multimodal benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Egocentric Co-Speech Grounding
audio-visual alignment
deictic interaction
multimodal benchmark
temporal grounding
🔎 Similar Papers
No similar papers found.
W
Weijie Zhou
Beijing Jiaotong University
X
Xuantang Xiong
Tencent Robotics X
Z
Zhenlin Hu
Harbin Institute of Technology, Shenzhen
X
Xiaomeng Zhu
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST)
Chaoyang Zhao
Chaoyang Zhao
Institute of Automation, Chinese Academy of Sciences
computer vision
H
Honghui Dong
Beijing Jiaotong University
Zhengyou Zhang
Zhengyou Zhang
Tencent AI Lab & Tencent Robotics X
Computer VisionMultimediaSpeechRoboticsAI
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (CASIA)
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (CASIA)