Temporal Consistency-Aware Text-to-Motion Generation

📅 2026-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-motion generation methods often produce semantically misaligned or physically implausible motions due to the lack of explicit modeling of cross-sequence temporal consistency. To address this limitation, this work proposes the TCA-T2M framework, which for the first time explicitly enforces temporal consistency across motion sequences. The approach employs a temporal-consistency-aware spatial VQ-VAE to align motion sequences, integrates a masked motion Transformer for text-conditioned generation, and incorporates a kinematic constraint module to enhance physical plausibility. Evaluated on the HumanML3D and KIT-ML benchmarks, the proposed method achieves state-of-the-art performance, significantly improving semantic alignment, temporal coherence, and physical realism of the generated motions.

Technology Category

Application Category

📝 Abstract
Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.
Problem

Research questions and friction points this paper is trying to address.

Temporal Consistency
Text-to-Motion Generation
Cross-sequence Alignment
Motion Synthesis
Semantic Misalignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Consistency
Text-to-Motion Generation
VQ-VAE
Masked Motion Transformer
Kinematic Constraints