T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing text-driven 3D human motion generation methods often overlook the coupling between motion periodicity and keyframe saliency, and are sensitive to semantically equivalent textual rephrasings, leading to drift and instability in long-sequence generation. To address these limitations, this work proposes a periodicity- and saliency-aware Mamba architecture that integrates enhanced density peak clustering to estimate keyframe weights and FFT-accelerated autocorrelation for periodicity analysis. Furthermore, we introduce the Periodicity-Differential Cross-Modal Alignment Module (PDCAM), which explicitly models the coupling mechanism between periodicity and saliency for the first time, thereby enhancing the robustness of text-motion embedding alignment. Extensive experiments on HumanML3D and KIT-ML demonstrate state-of-the-art performance, achieving an FID of 0.068 and significantly outperforming existing approaches across all metrics.

Technology Category

Application Category

📝 Abstract

Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.

Problem

Research questions and friction points this paper is trying to address.

text-to-motion generation

motion periodicity

keyframe saliency

semantic robustness

generation stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Periodicity-Saliency Coupling

Mamba Architecture

Cross-modal Alignment