3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the unnaturalness in speech-driven gesture generation caused by semantic inconsistency and spatial instability. It introduces, for the first time, diffusion policies from robotics into this domain, proposing a holistic motion control framework that achieves global coordination by modeling inter-frame dynamics of continuous trajectories. Furthermore, the authors design a Gesture-Audio-Phoneme (GAP) multimodal fusion module to enable fine-grained alignment among speech, body gestures, and facial expressions. Experimental results on the BEAT2 dataset demonstrate that the proposed method significantly outperforms existing approaches, generating gestures that are more natural, expressive, and tightly synchronized with input speech.

Technology Category

Application Category

📝 Abstract

Generating holistic co-speech gestures that integrate full-body motion with facial expressions suffers from semantically incoherent coordination on body motion and spatially unstable meaningless movements due to existing part-decomposed or frame-level regression methods, We introduce 3DGesPolicy, a novel action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem through diffusion policy from robotics. By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns and ensures both spatially and semantically coherent movement trajectories that adhere to realistic motion manifolds. To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module that can deeply integrate and refine multi-modal signals, ensuring structured and fine-grained alignment between speech semantics, body motion, and facial expressions. Extensive quantitative and qualitative experiments on the BEAT2 dataset demonstrate the effectiveness of our 3DGesPolicy across other state-of-the-art methods in generating natural, expressive, and highly speech-aligned holistic gestures.

Problem

Research questions and friction points this paper is trying to address.

co-speech gesture generation

holistic motion

semantic coherence

spatial stability

multimodal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion policy

holistic co-speech gesture

action-based control

phoneme-aware alignment

multi-modal fusion

🔎 Similar Papers

No similar papers found.