MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing speech-driven 3D human motion generation methods suffer from two key limitations: (1) coarse-grained style encoding, which fails to capture region-specific stylistic variations (e.g., upper vs. lower body); and (2) neglect of dynamic prosodic and affective cues in speech, leading to temporally inconsistent and expressively impoverished motion. This paper proposes a region-aware stylized motion generation framework. Its core contributions are: (i) a region-adaptive style encoder that explicitly models distinct stylistic characteristics for anatomically defined body parts; and (ii) a speech-driven, part-aware attention mechanism integrated with a denoising network, enabling fine-grained temporal alignment between speech features (prosody, emotion) and region-specific motion dynamics. Extensive experiments on multiple benchmark datasets demonstrate significant improvements in motion naturalness, style fidelity, and temporal coherence, outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.

Problem

Research questions and friction points this paper is trying to address.

Generating stylized 3D motion from speech signals

Capturing fine-grained regional motion style differences

Dynamically adapting motion to speech rhythm and emotion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Part-aware style injection for localized motion encoding

Part-aware denoising network capturing regional differences

Part-aware attention aligning motion with speech rhythm

🔎 Similar Papers

No similar papers found.