🤖 AI Summary
Existing text-to-3D human motion generation methods enforce alignment only at the sequence level, neglecting intra-modal semantic structure. This work proposes a **segment-level alignment paradigm**, decomposing both text and motion into semantically coherent temporal segments and establishing cross-modal segment-to-segment correspondences. The method comprises three core components: text segmentation and feature extraction, motion segmentation and boundary detection, and segment-level contrastive alignment—unified within a shared embedding space to enable fine-grained cross-modal retrieval. Evaluated on HumanML3D, our approach achieves a TOP-1 retrieval accuracy of 0.553, substantially outperforming sequence-level baselines. Moreover, it supports motion localization and bidirectional text–motion retrieval. To our knowledge, this is the first work to empirically validate the effectiveness and generalizability of segment-level alignment for embodied generative tasks.
📝 Abstract
Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.