PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
Existing text-to-motion generation methods often rely on a single holistic latent vector, which couples trajectory and joint rotations, limiting multi-task support and leading to error accumulation in long sequences. To address this, this work proposes PRISM, a novel framework that decouples human joints into independent latent tokens, forming a structured spatiotemporal latent space. PRISM introduces a token-level conditioning mechanism with timestep embeddings, enabling unified support for text-driven motion, pose-conditioned generation, and streaming motion synthesis. By integrating a causal variational autoencoder, forward kinematics supervision, and a denoising diffusion strategy, PRISM achieves state-of-the-art performance across multiple benchmarks—including HumanML3D, MotionHub, and BABEL—and demonstrates superior robustness in a 50-scenario user study, significantly mitigating motion drift in long-horizon generation.

Technology Category

Application Category

📝 Abstract
Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.
Problem

Research questions and friction points this paper is trying to address.

text-to-motion generation
motion autoencoding
latent representation
autoregressive generation
error accumulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

joint-factorized latent space
causal VAE
noise-free condition injection
autoregressive streaming synthesis
motion generation foundation model
🔎 Similar Papers