Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current embodied multimodal large language models often rely on linguistic shortcuts when processing referential language, neglecting fine-grained spatiotemporal alignment between speech and visual gestures, thereby failing to genuinely assess coreference understanding. To address this, this work proposes the Egocentric Co-Speech Grounding (EcoG) task, which requires models to jointly predict the “What,” “Where,” and “When” of referred objects. We introduce EcoG-Bench, the first egocentric benchmark for this task, comprising 811 bilingual (Chinese–English) video clips annotated with millisecond-level gesture labels, dense spatial bounding boxes, precise timestamps, and word-level ASR transcripts. Experiments show that humans achieve a strict accuracy of 96.9% on this benchmark, while the best current model reaches only 17.0%; incorporating timestamp-aligned frames and accurate ASR improves performance to 42.9%, highlighting the critical role of interface design in enabling temporal cue perception.

Technology Category

Application Category

📝 Abstract

In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.

Problem

Research questions and friction points this paper is trying to address.

deictic interaction

audio-visual alignment

egocentric grounding

co-speech gesture

multimodal benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Egocentric Co-Speech Grounding

audio-visual alignment

deictic interaction