🤖 AI Summary
Existing speech-driven 3D human motion generation methods suffer from two key limitations: (1) coarse-grained style encoding, which fails to capture region-specific stylistic variations (e.g., upper vs. lower body); and (2) neglect of dynamic prosodic and affective cues in speech, leading to temporally inconsistent and expressively impoverished motion. This paper proposes a region-aware stylized motion generation framework. Its core contributions are: (i) a region-adaptive style encoder that explicitly models distinct stylistic characteristics for anatomically defined body parts; and (ii) a speech-driven, part-aware attention mechanism integrated with a denoising network, enabling fine-grained temporal alignment between speech features (prosody, emotion) and region-specific motion dynamics. Extensive experiments on multiple benchmark datasets demonstrate significant improvements in motion naturalness, style fidelity, and temporal coherence, outperforming state-of-the-art methods.
📝 Abstract
Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.