RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation

πŸ“… 2026-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods for 3D human motion generation are hindered by the absence of large-scale, high-quality in-the-wild motion datasets: available data are either small yet high-fidelity or large but low-quality and highly redundant. To address this limitation, this work introduces RoMoβ€”the first large-scale in-the-wild human motion dataset that integrates semantic categorization with rigorous quality filtering. A semantic-aware filtering pipeline removes low-quality sequences, while a three-level semantic taxonomy enables fine-grained annotation and evaluation. Accompanying the dataset, the released Motion Toolbox provides automated annotation tools, standardized evaluation metrics, and visualization utilities to support reproducible research. Generative models trained on RoMo achieve state-of-the-art performance in motion fidelity, diversity, and comprehension of complex textual prompts.
πŸ“ Abstract
Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.
Problem

Research questions and friction points this paper is trying to address.

human motion generation
large-scale dataset
in-the-wild data
motion quality
semantic taxonomy
Innovation

Methods, ideas, or system contributions that make the work stand out.

human motion generation
semantic taxonomy
large-scale dataset
taxonomy-aware filtering
Motion Toolbox