Absolute Coordinates Make Motion Generation Easy

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing text-to-motion generation models predominantly adopt the pelvis-relative and inter-frame differenced local representation advocated by HumanML3D—beneficial for early GAN training but suboptimal for diffusion models and restrictive for downstream motion editing. This work proposes a paradigm shift toward regressing global 3D absolute joint coordinates, eliminating kinematic constraint losses and classifier-free guidance. We employ a pure Transformer architecture to directly regress absolute coordinates in world space. To our knowledge, this is the first systematic demonstration of the superiority of absolute representations in diffusion-based motion generation: on HumanML3D, it achieves significant improvements in FID (−12.3%), R-Precision (+8.7%), and MM-Dist (−15.1%). The approach enables zero-shot spatiotemporal editing and fine-grained, text-driven motion refinement. Moreover, it is the first to realize end-to-end, high-fidelity SMPL-H mesh animation generation from text.

Technology Category

Application Category

📝 Abstract

State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D, which encodes motion relative to the pelvis and to the previous frame with built-in redundancy. While this design simplifies training for earlier generation models, it introduces critical limitations for diffusion models and hinders applicability to downstream tasks. In this work, we revisit the motion representation and propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space. Through systematic analysis of design choices, we show that this formulation achieves significantly higher motion fidelity, improved text alignment, and strong scalability, even with a simple Transformer backbone and no auxiliary kinematic-aware losses. Moreover, our formulation naturally supports downstream tasks such as text-driven motion control and temporal/spatial editing without additional task-specific reengineering and costly classifier guidance generation from control signals. Finally, we demonstrate promising generalization to directly generate SMPL-H mesh vertices in motion from text, laying a strong foundation for future research and motion-related applications.

Problem

Research questions and friction points this paper is trying to address.

Improving text-to-motion generation fidelity and alignment

Simplifying motion representation for downstream tasks

Enhancing generalization for motion-related applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses absolute joint coordinates globally

Simplifies motion representation effectively

Supports downstream tasks naturally

🔎 Similar Papers

Real-time Motion Planning for autonomous vehicles in dynamic environments