π€ AI Summary
This paper addresses three key challenges in high-fidelity audio-driven gesture video generation: audio-visual one-to-many mapping, scarcity of high-quality annotated data, and high computational cost. Methodologically, it introduces a lightweight 2D full-body skeletal representation as an intermediate modality and designs a diffusion model to achieve fine-grained temporal alignment and fusion of audio and skeletal features; this is coupled with an off-the-shelf human video generation model for high-fidelity video synthesis. Contributions include: (1) releasing CSG-405βthe first large-scale public dataset comprising 405 hours of high-definition videos with precise 2D skeletal annotations; (2) enabling natural co-articulation of facial expressions and body gestures under low-resource settings; and (3) achieving state-of-the-art performance across visual quality, audio-visual synchronization, and cross-speaker/scene generalization.
π Abstract
Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts.