MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenges of long-horizon historical motion modeling, continuous text input, and strict real-time requirements in text-conditioned streaming human pose generation. We propose a diffusion-enhanced autoregressive modeling framework operating in a continuous causal latent space. Unlike diffusion models constrained by fixed-length horizons and GPT-style approaches suffering from latency and error accumulation due to discrete, non-causal tokenization, our method introduces a learnable motion encoder and a streaming decoder that jointly integrate diffusion priors with causal autoregressive dynamics in a continuous latent space—enabling millisecond-level low-latency, multi-turn interactive generation. Experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks, supporting stable generation of over 1,000 frames and dynamic cross-semantic instruction editing.

Technology Category

Application Category

📝 Abstract

This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/

Problem

Research questions and friction points this paper is trying to address.

Streaming motion generation with text conditioning

Overcoming pre-defined motion length constraints

Reducing error accumulation in autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous causal latent space for motion generation

Probabilistic autoregressive model reduces error accumulation

Temporal causal dependencies enhance online motion decoding

🔎 Similar Papers

HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device