Motion Anything: Any to Motion Generation

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address two key challenges in conditional motion generation—insufficient prioritization of dynamic keyframes and body-part modeling, and inadequate fusion of multimodal conditions—this paper proposes an attention-driven masked autoregressive framework. We design a spatiotemporal fine-grained attention masking mechanism to enable adaptive focusing on dynamic keyframes and anatomical joints. We introduce TMD, the first text–music–dance aligned multimodal dataset (2,153 triplets). Additionally, we propose a multimodal conditional adaptive encoder to unify heterogeneous inputs into a coherent latent representation. Quantitative evaluation shows a 15% FID reduction on HumanML3D; our method also achieves significant improvements over state-of-the-art approaches on both AIST++ and the new TMD benchmark. To our knowledge, this is the first work to enable high-fidelity, cross-modal controllable 3D human motion generation.

Technology Category

Application Category

📝 Abstract

Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Motion-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website https://steve-zeyu-zhang.github.io/MotionAnything

Problem

Research questions and friction points this paper is trying to address.

Lack of dynamic frame prioritization in motion generation models.

Ineffective integration of multiple conditioning modalities in motion generation.

Absence of a comprehensive dataset for text, music, and dance motion.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-based Mask Modeling for motion control

Multimodal condition encoding for improved controllability

Text-Motion-Dance dataset with 2,153 text-music-dance pairs

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)