OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge of generating full-body human motion from multimodal inputs—text, speech, and music. We propose the Continuous Masked Autoregressive Motion Transformer (CMAR-MoT), a unified architecture that models temporal dynamics via causal attention, mitigates multimodal distribution shifts using gated linear attention and RMSNorm, and implements conditional diffusion modeling based on the DiT backbone with AdaLN. Cross-attention enables seamless integration of heterogeneous conditioning signals. To our knowledge, CMAR-MoT is the first framework supporting joint generation across diverse tasks—including text-to-motion, speech-to-gesture, and music-to-dance—within a single model. Quantitative and qualitative evaluations demonstrate significant improvements over prior methods in motion naturalness, temporal coherence, and cross-modal generalization. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.

Problem

Research questions and friction points this paper is trying to address.

Generating whole-body human motion from multiple modalities

Integrating text, speech, and music into unified framework

Overcoming instability in multimodal motion generation distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous masked autoregressive motion transformer for generation

Gated linear attention and RMSNorm enhance key action focus

DiT structure with AdaLN and cross-attention fuses multimodal signals

🔎 Similar Papers

HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device