CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild

๐Ÿ“… 2024-05-27
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current speech-driven 3D gesture generation methods suffer from rigid outputs and temporal misalignment due to the scarcity of high-quality, paired 3D speech-gesture data. To address this, we propose a novel pretraining-finetuning paradigm. First, we introduce GES-Xโ€”the first large-scale, open-source 3D co-speech gesture dataset comprising 40 million pose frames. Second, we design an audio-conditioned ControlNet module to achieve precise acoustic-to-motor temporal alignment, and incorporate a Mixture-of-Gesture-Experts (MoGE) routing mechanism to enhance dynamic motion modeling. Finally, we instantiate a 1-billion-parameter diffusion model enabling zero-shot speech-to-3D-gesture generation. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods across multiple quantitative metrics. This work establishes a new paradigm and provides critical infrastructure for natural, robust speech-driven gesture synthesis in real-world applications.

Technology Category

Application Category

๐Ÿ“ Abstract
Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm. At the pretraining stage, we aim to formulate a large generalizable gesture diffusion model by learning the abundant postures manifold. Therefore, to alleviate the scarcity of 3D data, we first construct a large-scale co-speech 3D gesture dataset containing more than 40M meshed posture instances across 4.3K speakers, dubbed GES-X. Then, we scale up the large unconditional diffusion model to 1B parameters and pre-train it to be our gesture experts. At the finetune stage, we present the audio ControlNet that incorporates the human voice as condition prompts to guide the gesture generation. Here, we construct the audio ControlNet through a trainable copy of our pre-trained diffusion model. Moreover, we design a novel Mixture-of-Gesture-Experts (MoGE) block to adaptively fuse the audio embedding from the human speech and the gesture features from the pre-trained gesture experts with a routing mechanism. Such an effective manner ensures audio embedding is temporal coordinated with motion features while preserving the vivid and diverse gesture generation. Extensive experiments demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation. The dataset will be publicly available at: https://mattie-e.github.io/GES-X/
Problem

Research questions and friction points this paper is trying to address.

Generate coherent 3D gestures from unseen speech inputs
Overcome limited 3D speech-gesture data scarcity
Ensure vivid and diverse gesture synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrain-finetune paradigm for gesture synthesis
Large-scale 3D gesture dataset GES-X
Audio ControlNet with Mixture-of-Gesture-Experts block
๐Ÿ”Ž Similar Papers
No similar papers found.