🤖 AI Summary
To address weak semantic alignment and physically implausible motions in audio-driven human pose generation, this paper proposes a semantics-aware cross-modal alignment framework. It explicitly models fine-grained semantics—including verbs, nouns, and spatial relations—extracted from audio instructions and enforces kinematic constraints to ensure physical plausibility. We design an end-to-end model integrating Whisper for speech encoding, a semantics-enhanced Transformer decoder, and a differentiable SMPL pose regression module, jointly optimized via contrastive learning and kinematic loss. On the Audio-to-Pose benchmark, our method achieves an 18.7% improvement in pose accuracy, along with significant gains in BLEU-4 score and action verb recall. Notably, it is the first to generate semantically consistent and temporally coherent 3D motions for complex multi-step instructions (e.g., “turn around, pick up the cup on the table, and hand it over”).