🤖 AI Summary
To address the slow inference speed and poor real-time applicability of existing spatiotemporal-controllable human motion generation methods, this paper proposes the first Motion Latent Consistency Model (MLCM) tailored for motion generation, coupled with a latent-space Motion ControlNet that incorporates explicit motion control signals for supervised training. Built upon a single-step or few-step sampling strategy within a motion latent diffusion framework, our approach enables highly efficient generation. Experiments demonstrate that MLCM achieves high-fidelity motion synthesis while maintaining precise joint control over textual prompts and initial motion conditions. Crucially, it attains an inference latency of <100 ms per frame—significantly outperforming existing diffusion-based methods—and marks the first realization of real-time, dual-driven (text + initial motion) controllable motion generation.
📝 Abstract
This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial-temporal control in text-conditioned motion generation suffer from significant runtime inefficiency. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building on the motion latent diffusion model. By adopting one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM and enable explicit control signals (i.e., initial motions) in the vanilla motion space to further provide supervision for the training process. By employing these techniques, our approach can generate human motions with text and control signals in real-time. Experimental results demonstrate the remarkable generation and controlling capabilities of MotionLCM while maintaining real-time runtime efficiency.