OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of semantic and temporal misalignment across heterogeneous multimodal inputs (text, music, speech) in full-body motion generation. We propose OmniMotion, the first unified multimodal framework supporting text-to-motion, music-to-dance, speech-to-gesture, and global spatiotemporal control. To resolve cross-modal conflicts, we introduce reference motion as a strong conditioning signal and design a weak-to-strong progressive hybrid conditioning training strategy. Our method employs an autoregressive diffusion Transformer architecture to jointly model motion prediction, completion, and guided synthesis. Leveraging SMPL-X, we construct OmniMoCap-X—a large-scale, multimodal motion capture dataset with fine-grained hierarchical annotations generated by GPT-4o. Experiments demonstrate state-of-the-art performance across diverse tasks, significantly improving long-horizon motion consistency, content controllability, and cross-modal coherence. OmniMotion enables high-fidelity, interactive, and fine-grained controllable full-body motion generation.

Technology Category

Application Category

📝 Abstract

This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.

Problem

Research questions and friction points this paper is trying to address.

Generating versatile whole-body human motions from diverse multimodal inputs

Enhancing motion consistency and realism through novel conditioning signals

Resolving multimodal conflicts in unified motion generation framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive diffusion transformer for unified motion generation

Reference motion conditioning for enhanced content consistency

Progressive weak-to-strong mixed-condition training strategy

🔎 Similar Papers

No similar papers found.