Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing speech-driven gesture generation methods suffer from three key limitations: inaccurate audio-gesture temporal alignment, weak semantic contextual modeling, and insufficient pixel-level photorealism. To address these, we propose a three-stage collaborative framework: (1) an explicit temporal alignment mechanism that precisely locates rhythmic and semantic trigger points in speech; (2) a knowledge distillation–based contextualized gesture tokenization method for semantics-aware motion representation learning; and (3) a structure-aware refinement module integrating skeletal topology constraints with edge-guided generative video synthesis. Our approach ensures long-sequence temporal coherence while enabling fine-grained, video-level gesture editing. Quantitatively, it achieves state-of-the-art performance across synchronization accuracy (SyncScore), perceptual realism (FID/LPIPS), and semantic fidelity.

Technology Category

Application Category

📝 Abstract

Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1 Project Page: https://andypinxinliu.github.io/Contextual-Gesture/.

Problem

Research questions and friction points this paper is trying to address.

Enhance co-speech gesture generation for lifelike avatars

Improve contextualized gesture patterns from audio

Achieve pixel-level realism in gesture video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chronological speech-gesture alignment

Contextualized gesture tokenization

Structure-aware refinement module

🔎 Similar Papers

No similar papers found.