HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address coarse spatial modeling, temporal feature redundancy, and loss of joint-level details in text-to-motion generation, this paper proposes a hierarchical spatiotemporal fusion framework. Methodologically, it introduces the first dual-space Mamba architecture—modeling both part-level and global-level motion—complemented by bidirectional temporal Mamba for efficient sequence modeling, and a Dynamic Spatiotemporal Fusion Module (DSFM) to suppress redundancy and enhance complementary feature integration. Crucially, it is the first work to systematically incorporate structured state space models into fine-grained human motion generation, integrating multi-granularity human topology encodings. Evaluated on HumanML3D, our method achieves a new state-of-the-art FID of 0.189 (a 30% reduction), along with superior motion diversity, semantic alignment, and joint-level accuracy. The generated motions exhibit significantly improved naturalness and stronger text-motion consistency.

Technology Category

Application Category

📝 Abstract
Text-to-motion generation is a rapidly growing field at the nexus of multimodal learning and computer graphics, promising flexible and cost-effective applications in gaming, animation, robotics, and virtual reality. Existing approaches often rely on simple spatiotemporal stacking, which introduces feature redundancy, while subtle joint-level details remain overlooked from a spatial perspective. To this end, we propose a novel HiSTF Mamba framework. The framework is composed of three key modules: Dual-Spatial Mamba, Bi-Temporal Mamba, and Dynamic Spatiotemporal Fusion Module (DSFM). Dual-Spatial Mamba incorporates ``Part-based + Whole-based'' parallel modeling to represent both whole-body coordination and fine-grained joint dynamics. Bi-Temporal Mamba adopts a bidirectional scanning strategy, effectively encoding short-term motion details and long-term dependencies. DSFM further performs redundancy removal and extraction of complementary information for temporal features, then fuses them with spatial features, yielding an expressive spatio-temporal representation. Experimental results on the HumanML3D dataset demonstrate that HiSTF Mamba achieves state-of-the-art performance across multiple metrics. In particular, it reduces the FID score from 0.283 to 0.189, a relative decrease of nearly 30%. These findings validate the effectiveness of HiSTF Mamba in achieving high fidelity and strong semantic alignment in text-to-motion generation.
Problem

Research questions and friction points this paper is trying to address.

Enhances text-to-motion generation with hierarchical spatiotemporal fusion.
Addresses feature redundancy and overlooked joint-level details in motion modeling.
Improves fidelity and semantic alignment in motion generation applications.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Spatial Mamba for joint dynamics modeling
Bi-Temporal Mamba for motion detail encoding
Dynamic Spatiotemporal Fusion for feature refinement
🔎 Similar Papers
No similar papers found.
X
Xingzu Zhan
Shenzhen University, Shenzhen, China
Chen Xie
Chen Xie
Politecnico di Torino
Synthesis of smart sensors
H
Haoran Sun
Shenzhen University, Shenzhen, China
X
Xiaochun Mai
Shenzhen University, Shenzhen, China