AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

πŸ“… 2026-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing multimodal human motion generation methods are constrained by fixed modality configurations and task-specific architectures, limiting their generalization to arbitrary modality combinations and suffering from a lack of large-scale aligned data. This work proposes AnyMo, a unified framework that integrates a residual FSQ motion tokenizer with a scalable masked modeling Transformer, enabling, for the first time, high-fidelity motion generation conditioned on arbitrary modalities with explicit control over spatial structure and stylistic attributes. To support this advancement, we introduce OmniHuMoβ€”the first large-scale multimodal aligned motion dataset, comprising 5,000 hours of high-quality motion sequences and 3.2 million aligned annotations. Systematic exploration of scaling laws in multimodal conditional synthesis within this framework substantially enhances cross-modal generalization and generation quality.
πŸ“ Abstract
Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.
Problem

Research questions and friction points this paper is trying to address.

conditional motion generation
multimodal synthesis
cross-modal interaction
modality-aligned data
scaling laws
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked modeling
multimodal motion generation
motion tokenizer
scalable transformer
modality-aligned dataset
Y
Yiheng Li
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China
Z
Zhuo Li
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China
Ruibing Hou
Ruibing Hou
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionDeep Learning
Y
Yingjie Chen
Independent Author
Hong Chang
Hong Chang
Researcher at Institute of Computing Technology, Chinese Academy of Sciences
Machine LearningComputer VisionPattern Recognition
H
Hao Liu
Independent Author
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition