MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving high fidelity, low latency, streaming generation, and scalability in text-driven 3D human motion synthesis. We propose a joint framework comprising MoSA-VQ (Motion Scale-Adaptive residual Vector Quantization) and RQHC-Transformer (Residual Quantized Hierarchical Causal Transformer). MoSA-VQ enables compact, multi-granularity motion representation via scale-adaptive residual quantization, while RQHC-Transformer supports single-step, multi-layer motion token generation and streaming autoregressive decoding. A novel text-motion cross-modal alignment mechanism is introduced to enhance semantic consistency. Evaluated on HumanML3D, KIT-ML, and CMP benchmarks, our method achieves state-of-the-art generation quality—measured by diversity, realism, and text-motion alignment—while significantly reducing inference latency. Notably, it is the first approach to enable real-time, high-fidelity, zero-shot streaming motion generation within a single forward pass.

Technology Category

Application Category

📝 Abstract

Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.

Problem

Research questions and friction points this paper is trying to address.

Achieving high fidelity and real-time 3D motion generation

Reducing inference latency in autoregressive motion synthesis

Enhancing semantic fidelity under textual control

Innovation

Methods, ideas, or system contributions that make the work stand out.

MoSA-VQ for hierarchical motion discretization

RQHC-Transformer for single-pass token generation

Text alignment for improved semantic fidelity

🔎 Similar Papers

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation