🤖 AI Summary
Current embodied multimodal large language models often rely on linguistic shortcuts when processing referential language, neglecting fine-grained spatiotemporal alignment between speech and visual gestures, thereby failing to genuinely assess coreference understanding. To address this, this work proposes the Egocentric Co-Speech Grounding (EcoG) task, which requires models to jointly predict the “What,” “Where,” and “When” of referred objects. We introduce EcoG-Bench, the first egocentric benchmark for this task, comprising 811 bilingual (Chinese–English) video clips annotated with millisecond-level gesture labels, dense spatial bounding boxes, precise timestamps, and word-level ASR transcripts. Experiments show that humans achieve a strict accuracy of 96.9% on this benchmark, while the best current model reaches only 17.0%; incorporating timestamp-aligned frames and accurate ASR improves performance to 42.9%, highlighting the critical role of interface design in enabling temporal cue perception.
📝 Abstract
In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.