Semantics-aware human motion generation from audio instructions

📅 2025-05-29
🏛️ Graphical Models
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak semantic alignment and physically implausible motions in audio-driven human pose generation, this paper proposes a semantics-aware cross-modal alignment framework. It explicitly models fine-grained semantics—including verbs, nouns, and spatial relations—extracted from audio instructions and enforces kinematic constraints to ensure physical plausibility. We design an end-to-end model integrating Whisper for speech encoding, a semantics-enhanced Transformer decoder, and a differentiable SMPL pose regression module, jointly optimized via contrastive learning and kinematic loss. On the Audio-to-Pose benchmark, our method achieves an 18.7% improvement in pose accuracy, along with significant gains in BLEU-4 score and action verb recall. Notably, it is the first to generate semantically consistent and temporally coherent 3D motions for complex multi-step instructions (e.g., “turn around, pick up the cup on the table, and hand it over”).

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Generating human motions aligned with audio semantics
Overcoming weak audio-motion semantic connection in existing methods
Enhancing datasets with conversational audio for practical interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked generative transformer for motion generation
Memory-retrieval attention for sparse audio inputs
Enriched datasets with conversational audio descriptions
🔎 Similar Papers
2023-11-29IEEE Transactions on Visualization and Computer GraphicsCitations: 0