PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-motion generation methods often rely on a single holistic latent vector, which couples trajectory and joint rotations, limiting multi-task support and leading to error accumulation in long sequences. To address this, this work proposes PRISM, a novel framework that decouples human joints into independent latent tokens, forming a structured spatiotemporal latent space. PRISM introduces a token-level conditioning mechanism with timestep embeddings, enabling unified support for text-driven motion, pose-conditioned generation, and streaming motion synthesis. By integrating a causal variational autoencoder, forward kinematics supervision, and a denoising diffusion strategy, PRISM achieves state-of-the-art performance across multiple benchmarks—including HumanML3D, MotionHub, and BABEL—and demonstrates superior robustness in a 50-scenario user study, significantly mitigating motion drift in long-horizon generation.

Technology Category

Application Category

📝 Abstract
Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.
Problem

Research questions and friction points this paper is trying to address.

text-to-motion generation
motion autoencoding
latent representation
autoregressive generation
error accumulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

joint-factorized latent space
causal VAE
noise-free condition injection
autoregressive streaming synthesis
motion generation foundation model
🔎 Similar Papers
No similar papers found.
Zeyu Ling
Zeyu Ling
Zhejiang University
Computer Vision
Qing Shuai
Qing Shuai
Tencent
computer vision
Teng Zhang
Teng Zhang
Schrödinger, University of Notre Dame, University of Science and Technology of China
Heat transferNanoscalePolymer
Shiyang Li
Shiyang Li
Amazon
Machine LearningNatural Language ProcessingTime Series Modeling
B
Bo Han
Computer Animation & Perception Group, Zhejiang University, Hangzhou, China
C
Changqing Zou
State Key Laboratory of CAD & CG, Zhejiang University, Hangzhou, China; Zhejiang Lab, Hangzhou, China