Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in text-driven zero-shot human motion generation: limited generalization capability, insufficient training data scale, and the absence of standardized evaluation protocols. Methodologically, we introduce MotionMillion—the largest high-quality motion dataset to date (2 million sequences)—and propose MotionMillion-Eval, the first comprehensive benchmark specifically designed for zero-shot motion generation evaluation. Leveraging an efficient annotation pipeline and a scalable architecture, we train a 7B-parameter end-to-end generative model. Our contributions include substantial improvements in zero-shot generalization across domains and for complex compositional motions, achieving state-of-the-art performance on multiple novel benchmarks. Furthermore, we publicly release both the code and dataset, establishing a robust foundation and a unified evaluation standard for the community.

Technology Category

Application Category

📝 Abstract
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
Problem

Research questions and friction points this paper is trying to address.

Achieving zero-shot generalization in text-to-motion generation
Addressing limited dataset size for training motion models
Developing a comprehensive benchmark for motion generation evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest human motion dataset with 2M sequences
Comprehensive benchmark for zero-shot evaluation
Scalable 7B model for generalization performance
🔎 Similar Papers
No similar papers found.
Ke Fan
Ke Fan
Fudan University
Machine LearningDeep Learning
Shunlin Lu
Shunlin Lu
The Chinese University of Hongkong, Shenzhen
M
Minyue Dai
Fudan University
R
Runyi Yu
HKUST
Lixing Xiao
Lixing Xiao
Zhejiang University
Computer VisionDeep LearningGenerative AI
Z
Zhiyang Dou
HKU
Junting Dong
Junting Dong
Zhejiang University
Computer Vision
L
Lizhuang Ma
Shanghai Jiao Tong University, East China Normal University
J
Jingbo Wang
Shanghai AI Laboratory