The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

πŸ“… 2025-10-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current 3D human motion generation models suffer from limited generalization capability, whereas video generation models demonstrate superior behavioral modeling generalization. To bridge this gap, we propose ViMoGenβ€”a novel framework that systematically transfers knowledge from video generation to motion generation for the first time. Our approach introduces (1) ViMoGen-228K, a large-scale multimodal dataset comprising text-video-motion triplets; (2) ViMoGen, a flow-matching-based diffusion Transformer architecture, along with its lightweight variant ViMoGen-light; and (3) a gated multimodal conditioning mechanism. We further establish MBench, a hierarchical evaluation benchmark for comprehensive motion assessment. Extensive experiments demonstrate that ViMoGen significantly outperforms state-of-the-art methods in motion quality, prompt fidelity, and cross-scenario generalization, achieving leading performance in both automated and human evaluations. To foster reproducibility and community advancement, we will open-source our code, dataset, and benchmark.

Technology Category

Application Category

πŸ“ Abstract
Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addressing generalization bottleneck in 3D human motion generation
Transferring video generation insights to motion generation systems
Developing unified framework for motion data modeling and evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset integrates motion capture with video data
Flow-matching diffusion transformer unifies multimodal priors
Hierarchical benchmark evaluates quality and generalization
πŸ”Ž Similar Papers
No similar papers found.
J
Jing Lin
Nanyang Technological University
R
Ruisi Wang
SenseTime Research
Junzhe Lu
Junzhe Lu
Tsinghua University
computer visiongenerative modeling
Ziqi Huang
Ziqi Huang
Ph.D. Student, MMLab@NTU, Nanyang Technological University
Computer Vision
G
Guorui Song
Tsinghua University
Ailing Zeng
Ailing Zeng
Anuttacon
Deep LearningComputer VisionVirtual HumansVideo Generation
X
Xian Liu
NVIDIA Research
C
Chen Wei
SenseTime Research
Wanqi Yin
Wanqi Yin
SenseTime Research
Computer VisionMotion CaptureDigital Human
Q
Qingping Sun
SenseTime Research
Z
Zhongang Cai
SenseTime Research
L
Lei Yang
SenseTime Research
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics