Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the trade-off between motion diversity and semantic fidelity in text-to-3D human motion generation, this paper proposes a generative framework based on stochastic sampling in the latent space. The core innovation lies in explicitly modeling Gaussian noise as a diversity carrier and introducing a learnable random sampling mechanism within the continuous text latent space, jointly optimized with motion representation via a Transformer architecture. Unlike methods relying on discrete latent variables or post-hoc augmentation, our approach intrinsically models motion stochasticity at the generation source. Evaluated on HumanML3D and KIT-ML benchmarks, the method achieves state-of-the-art text-motion alignment (R-Precision) while significantly improving diversity—reducing FID by 12.3% and increasing MultiModality by 18.7%.

Technology Category

Application Category

📝 Abstract

Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, extit{i.e.}, Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.

Problem

Research questions and friction points this paper is trying to address.

Generating diverse 3D human motions from text

Ensuring text-motion semantic consistency during generation

Overcoming limited diversity in existing text-to-motion methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces uncertainty for diverse motion generation

Utilizes noise signals in transformer-based methods

Constructs continuous latent space with stochastic sampling

🔎 Similar Papers

HUMOS: Human Motion Model Conditioned on Body Shape