OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of semantic and temporal misalignment across heterogeneous multimodal inputs (text, music, speech) in full-body motion generation. We propose OmniMotion, the first unified multimodal framework supporting text-to-motion, music-to-dance, speech-to-gesture, and global spatiotemporal control. To resolve cross-modal conflicts, we introduce reference motion as a strong conditioning signal and design a weak-to-strong progressive hybrid conditioning training strategy. Our method employs an autoregressive diffusion Transformer architecture to jointly model motion prediction, completion, and guided synthesis. Leveraging SMPL-X, we construct OmniMoCap-X—a large-scale, multimodal motion capture dataset with fine-grained hierarchical annotations generated by GPT-4o. Experiments demonstrate state-of-the-art performance across diverse tasks, significantly improving long-horizon motion consistency, content controllability, and cross-modal coherence. OmniMotion enables high-fidelity, interactive, and fine-grained controllable full-body motion generation.

Technology Category

Application Category

📝 Abstract
This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.
Problem

Research questions and friction points this paper is trying to address.

Generating versatile whole-body human motions from diverse multimodal inputs
Enhancing motion consistency and realism through novel conditioning signals
Resolving multimodal conflicts in unified motion generation framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive diffusion transformer for unified motion generation
Reference motion conditioning for enhanced content consistency
Progressive weak-to-strong mixed-condition training strategy
🔎 Similar Papers
No similar papers found.