🤖 AI Summary
Existing text-to-motion generation methods commonly adopt a uniform human body model, neglecting the natural influence of body shape on motion dynamics and thereby producing kinematically implausible motions. To address this, we propose the first shape-aware text-driven motion synthesis framework. Our approach explicitly incorporates continuous body shape parameters (e.g., SMPL β) into the generative pipeline: motion sequences are discretized via FSQ-VAE; a joint language model jointly predicts shape and motion tokens conditioned on text; and motion is decoded under explicit shape conditioning, enabling learnable shape–motion associations. Quantitative evaluation, qualitative analysis, and user studies on AMASS and HumanML3D demonstrate that our method significantly improves motion plausibility and shape–motion consistency. It establishes new state-of-the-art performance in shape-aware text-to-motion generation.
📝 Abstract
We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.