Understanding Co-speech Gestures in-the-wild

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This paper addresses gesture understanding in natural speech scenarios, introducing the first systematic definition and construction of a novel tri-modal (gesture–text–speech) joint understanding paradigm. It establishes three benchmark tasks: gesture retrieval, gesture-related word detection, and gesture-guided active speaker identification. Methodologically, we propose an end-to-end weakly supervised tri-modal representation learning framework, incorporating a global phrase-level contrastive loss and a local gesture–word coupling loss to jointly optimize complementary modeling of gestures via speech and text. Built upon tri-modal joint embedding, cross-modal alignment, and multi-task co-training, our model achieves significant improvements over state-of-the-art methods—including large vision-language models—across all three tasks. To foster community advancement, we publicly release a curated dataset, trained models, and source code, thereby supporting research in embodied interaction and multimodal understanding.

Technology Category

Application Category

📝 Abstract

Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal

Problem

Research questions and friction points this paper is trying to address.

Understanding co-speech gestures in natural settings

Evaluating gesture-text-speech association comprehension

Learning tri-modal speech-text-video-gesture representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tri-modal speech-text-video-gesture representation learning

Global phrase contrastive and local gesture-word coupling losses

Weakly supervised learning from in-the-wild videos

🔎 Similar Papers

No similar papers found.