🤖 AI Summary
Existing approaches struggle to simultaneously achieve real-time streaming motion generation and high-quality offline synthesis while lacking the ability to anticipate future actions. To address this, this work proposes ActionPlan, a unified motion diffusion framework that leverages frame-level textual latent variables as semantic anchors to guide per-frame action planning. The framework introduces dedicated diffusion timesteps and an independent denoising mechanism, enabling flexible sampling orders. Within a single model, ActionPlan supports history-conditioned, future-aware real-time streaming generation alongside high-fidelity offline synthesis, and further enables zero-shot motion editing and interpolation. Experiments demonstrate that the method achieves a 5.25× speedup in streaming generation and an 18% improvement in FID over the current state-of-the-art, significantly outperforming existing approaches.
📝 Abstract
We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues. To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation. The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25x faster while also achieving 18% motion quality improvement over the best previous method in terms of FID.