Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses poor audio–gesture synchronization and unnatural motion in speech-driven co-speech gesture video generation. Methodologically, it introduces a novel framework integrating diffusion models with motion graph retrieval: (1) A diffusion model implicitly captures the many-to-many mapping between audio and gestures, conditioned on both low-level (spectral) and high-level (semantic) audio features; (2) A motion graph is constructed, and a context-aware retrieval algorithm is designed to select motion paths that jointly optimize global trajectory consistency and local motion similarity; (3) Initial gesture sequences are generated via diffusion, then refined through graph-based retrieval and seamless segment concatenation to produce coherent videos. Experiments demonstrate significant improvements over state-of-the-art methods in both audio–gesture temporal alignment accuracy and motion naturalness.

Technology Category

Application Category

📝 Abstract
Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.
Problem

Research questions and friction points this paper is trying to address.

Generating synchronized and natural co-speech gesture videos from audio input.
Addressing many-to-many audio-gesture mapping limitations in motion graph retrieval.
Enhancing gesture motion generation with diffusion models and multi-level audio features.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model learns audio-gesture joint distribution
Extracts multi-level audio features for enhanced training
Motion-based graph retrieval with global-local similarity assessment
🔎 Similar Papers
No similar papers found.
Yafei Song
Yafei Song
Alibaba Group
Computer VisionMachine LearningAugmented RealityRobotics
P
Peng Zhang
Tongyi Lab, Alibaba Group
B
Bang Zhang
Tongyi Lab, Alibaba Group