ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the challenges of missing fine-grained details and structural inconsistencies in text-driven human motion generation by proposing a scale-autoregressive framework that models motion synthesis as a coarse-to-fine multi-scale prediction process. The method introduces a novel skeletal-hierarchy-preserving multi-scale discrete representation, integrating bit-level quantization with autoregressive scale prediction. Additionally, it incorporates a structure-aware motion tokenizer that enables training-free, text-guided motion editing. Evaluated on HumanML3D, the approach achieves a state-of-the-art FID of 0.030, and on SnapMoGen, it attains a CLIP Score of 0.693, outperforming existing methods in both metrics.
📝 Abstract
We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.
Problem

Research questions and friction points this paper is trying to address.

human motion generation
text-driven animation
multi-scale representation
3D motion modeling
skeletal hierarchy
Innovation

Methods, ideas, or system contributions that make the work stand out.

scale-wise autoregressive
multi-scale motion representation
discrete motion tokenization
bitwise quantization
skeletal hierarchy preservation