Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between motion diversity and semantic fidelity in text-to-3D human motion generation, this paper proposes a generative framework based on stochastic sampling in the latent space. The core innovation lies in explicitly modeling Gaussian noise as a diversity carrier and introducing a learnable random sampling mechanism within the continuous text latent space, jointly optimized with motion representation via a Transformer architecture. Unlike methods relying on discrete latent variables or post-hoc augmentation, our approach intrinsically models motion stochasticity at the generation source. Evaluated on HumanML3D and KIT-ML benchmarks, the method achieves state-of-the-art text-motion alignment (R-Precision) while significantly improving diversity—reducing FID by 12.3% and increasing MultiModality by 18.7%.

Technology Category

Application Category

📝 Abstract
Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, extit{i.e.}, Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.
Problem

Research questions and friction points this paper is trying to address.

Generating diverse 3D human motions from text
Ensuring text-motion semantic consistency during generation
Overcoming limited diversity in existing text-to-motion methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces uncertainty for diverse motion generation
Utilizes noise signals in transformer-based methods
Constructs continuous latent space with stochastic sampling
🔎 Similar Papers
No similar papers found.
Z
Zheng Qin
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
Yabing Wang
Yabing Wang
Xi’an Jiaotong University
multimodal learning
Minghui Yang
Minghui Yang
Ant Group
NLPDialogueGraph3DV
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
M
Ming Yang
Ant Group, Hangzhou, Zhejiang 310000, China
L
Le Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China