Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Text-to-3D human motion generation faces two key challenges: poor generalization—especially to out-of-distribution motions—and coarse-grained control, hindering frame-level precision. To address these, we propose MoMADiff, the first framework integrating masked autoregressive diffusion into continuous-frame motion representation. It enables spatio-temporal fine-grained controllability by allowing users to specify sparse keyframes. Furthermore, we enhance latent-space modeling via a VQVAE-augmented architecture to improve joint discrete-continuous representation learning. Evaluated on two newly constructed sparse keyframe datasets and two standard benchmarks, MoMADiff achieves state-of-the-art performance in motion quality, text instruction fidelity, and keyframe alignment accuracy.

Technology Category

Application Category

📝 Abstract

Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence.

Problem

Research questions and friction points this paper is trying to address.

Generating diverse 3D human motion from text descriptions robustly

Overcoming limitations of discrete tokens in representing novel motions

Achieving fine-grained control over motion synthesis frames

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines masked modeling with diffusion processes

Uses frame-level continuous representations

Supports flexible keyframe specification

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion