ControlMM: Controllable Masked Motion Generation

📅 2024-10-14
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Existing motion diffusion models struggle to simultaneously achieve high-precision spatial control and high-fidelity motion generation. To address this, we propose mask-consistent modeling coupled with a logit-editing mechanism at inference time. Our approach integrates masked autoregressive modeling, stochastic mask reconstruction, conditional distribution calibration, and parallel iterative motion token decoding. This enables real-time, highly controllable, and photorealistic human motion synthesis with arbitrary joint- or frame-level control, limb-level temporal editing, and obstacle avoidance. Experiments demonstrate state-of-the-art performance: a Fréchet Inception Distance (FID) of 0.061—4.4× lower than prior art—control error reduced to 0.0091, and generation speed accelerated by 20× compared to conventional diffusion-based methods.

Technology Category

Application Category

📝 Abstract
Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, despite achieving acceptable control precision, these models suffer from generation speed and fidelity limitations. To address these challenges, we propose ControlMM, a novel approach incorporating spatial control signals into the generative masked motion model. ControlMM achieves real-time, high-fidelity, and high-precision controllable motion generation simultaneously. Our approach introduces two key innovations. First, we propose masked consistency modeling, which ensures high-fidelity motion generation via random masking and reconstruction, while minimizing the inconsistency between the input control signals and the extracted control signals from the generated motion. To further enhance control precision, we introduce inference-time logit editing, which manipulates the predicted conditional motion distribution so that the generated motion, sampled from the adjusted distribution, closely adheres to the input control signals. During inference, ControlMM enables parallel and iterative decoding of multiple motion tokens, allowing for high-speed motion generation. Extensive experiments show that, compared to the state of the art, ControlMM delivers superior results in motion quality, with better FID scores (0.061 vs 0.271), and higher control precision (average error 0.0091 vs 0.0108). ControlMM generates motions 20 times faster than diffusion-based methods. Additionally, ControlMM unlocks diverse applications such as any joint any frame control, body part timeline control, and obstacle avoidance. Video visualization can be found at https://exitudio.github.io/ControlMM-page
Problem

Research questions and friction points this paper is trying to address.

Achieving high-precision spatio-temporal motion control
Maintaining high-quality motion generation fidelity
Enabling diverse joint-frame control applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Logits Regularizer aligns motion tokens distribution
Logit Optimization reshapes token distribution precisely
Differentiable Expectation Sampling enables gradient flow
🔎 Similar Papers
No similar papers found.
E
Ekkasit Pinyoanuntapong
University of North Carolina at Charlotte
M
Muhammad Usama Saleem
University of North Carolina at Charlotte
Korrawe Karunratanakul
Korrawe Karunratanakul
Postdoc, ETH Zurich
Computer VisionHuman MotionGenerative Models
P
Pu Wang
University of North Carolina at Charlotte
H
Hongfei Xue
University of North Carolina at Charlotte
C
Chen Chen
University of Central Florida
C
Chuan Guo
Snap Inc.
J
Junli Cao
Snap Inc.
J
Jian Ren
Snap Inc.
S
S. Tulyakov
Snap Inc.