WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Text-to-motion retrieval (TMR) suffers from inaccurate semantic alignment due to the human body’s structural complexity and coarse-grained spatiotemporal modeling. To address this, we propose a wavelet-based multi-frequency feature extraction framework—the first to integrate trajectory wavelet decomposition with learnable inverse reconstruction for joint-level, multi-resolution spatiotemporal dynamics modeling. We further introduce a temporal re-ranking pretraining strategy, leveraging a shuffle-and-predict task to enhance time-coherence learning without auxiliary annotations. Our method achieves new state-of-the-art performance, improving Rsum by 17.0% on HumanML3D and 18.2% on KIT-ML. Key contributions include: (i) wavelet-driven fine-grained motion representation; (ii) a learnable reconstruction mechanism enabling semantic disentanglement in the frequency domain; and (iii) a time-aware pretraining paradigm requiring no additional supervision.

Technology Category

Application Category

📝 Abstract

Text-Motion Retrieval (TMR) aims to retrieve 3D motion sequences semantically relevant to text descriptions. However, matching 3D motions with text remains highly challenging, primarily due to the intricate structure of human body and its spatial-temporal dynamics. Existing approaches often overlook these complexities, relying on general encoding methods that fail to distinguish different body parts and their dynamics, limiting precise semantic alignment. To address this, we propose WaMo, a novel wavelet-based multi-frequency feature extraction framework. It fully captures part-specific and time-varying motion details across multiple resolutions on body joints, extracting discriminative motion features to achieve fine-grained alignment with texts. WaMo has three key components: (1) Trajectory Wavelet Decomposition decomposes motion signals into frequency components that preserve both local kinematic details and global motion semantics. (2) Trajectory Wavelet Reconstruction uses learnable inverse wavelet transforms to reconstruct original joint trajectories from extracted features, ensuring the preservation of essential spatial-temporal information. (3) Disordered Motion Sequence Prediction reorders shuffled motion sequences to improve the learning of inherent temporal coherence, enhancing motion-text alignment. Extensive experiments demonstrate WaMo's superiority, achieving 17.0% and 18.2% improvements in $Rsum$ on HumanML3D and KIT-ML datasets, respectively, outperforming existing state-of-the-art (SOTA) methods.

Problem

Research questions and friction points this paper is trying to address.

Matching 3D motions with text is challenging due to complex body dynamics.

Existing methods fail to capture part-specific motion details for precise alignment.

Proposes wavelet-based multi-frequency feature extraction for fine-grained text-motion retrieval.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wavelet-based multi-frequency feature extraction

Part-specific time-varying motion details

Disordered motion sequence prediction

🔎 Similar Papers

No similar papers found.