Democratizing High-Fidelity Co-Speech Gesture Video Generation

πŸ“… 2025-07-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses three key challenges in high-fidelity audio-driven gesture video generation: audio-visual one-to-many mapping, scarcity of high-quality annotated data, and high computational cost. Methodologically, it introduces a lightweight 2D full-body skeletal representation as an intermediate modality and designs a diffusion model to achieve fine-grained temporal alignment and fusion of audio and skeletal features; this is coupled with an off-the-shelf human video generation model for high-fidelity video synthesis. Contributions include: (1) releasing CSG-405β€”the first large-scale public dataset comprising 405 hours of high-definition videos with precise 2D skeletal annotations; (2) enabling natural co-articulation of facial expressions and body gestures under low-resource settings; and (3) achieving state-of-the-art performance across visual quality, audio-visual synchronization, and cross-speaker/scene generalization.

Technology Category

Application Category

πŸ“ Abstract
Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts.
Problem

Research questions and friction points this paper is trying to address.

Synthesize realistic audio-aligned gesture videos
Overcome one-to-many audio-visual mapping challenges
Address dataset scarcity and high computational costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight 2D skeleton framework
Diffusion model with audio-skeleton fusion
Public CSG-405 dataset creation
πŸ”Ž Similar Papers
No similar papers found.
X
Xu Yang
South China University of Technology
Shaoli Huang
Shaoli Huang
Tencent AI-Lab
Deep learningComputer Vision
S
Shenbo Xie
South China University of Technology
X
Xuelin Chen
Tencent AI Lab
Y
Yifei Liu
South China University of Technology
Changxing Ding
Changxing Ding
Professor@South China University of Technology
Computer VisionEmbodied AI