Democratizing High-Fidelity Co-Speech Gesture Video Generation

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper addresses three key challenges in high-fidelity audio-driven gesture video generation: audio-visual one-to-many mapping, scarcity of high-quality annotated data, and high computational cost. Methodologically, it introduces a lightweight 2D full-body skeletal representation as an intermediate modality and designs a diffusion model to achieve fine-grained temporal alignment and fusion of audio and skeletal features; this is coupled with an off-the-shelf human video generation model for high-fidelity video synthesis. Contributions include: (1) releasing CSG-405—the first large-scale public dataset comprising 405 hours of high-definition videos with precise 2D skeletal annotations; (2) enabling natural co-articulation of facial expressions and body gestures under low-resource settings; and (3) achieving state-of-the-art performance across visual quality, audio-visual synchronization, and cross-speaker/scene generalization.

Technology Category

Application Category

📝 Abstract

Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts.

Problem

Research questions and friction points this paper is trying to address.

Synthesize realistic audio-aligned gesture videos

Overcome one-to-many audio-visual mapping challenges

Address dataset scarcity and high computational costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight 2D skeleton framework

Diffusion model with audio-skeleton fusion

Public CSG-405 dataset creation

🔎 Similar Papers

No similar papers found.