Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing motion retrieval methods suffer from unintuitive user interaction and insufficient modeling of temporal alignment across multimodal sequences. To address these limitations, we propose the first fine-grained joint embedding framework integrating text, audio, video, and motion—introducing audio as a novel modality to enhance interactivity, immersion, and accessibility. We design a sequence-level contrastive learning mechanism that explicitly models cross-modal temporal correspondences, enabling more precise multimodal alignment. Furthermore, we incorporate synthetic audio data augmentation and optimize a unified similarity metric. Evaluated on the HumanML3D benchmark, our method achieves +10.16% in text-to-motion Recall@10 and +25.43% in video-to-motion Recall@1. The four-modal approach significantly outperforms three-modal baselines, empirically validating the critical roles of audio integration and explicit temporal sequence modeling in multimodal motion retrieval.

Technology Category

Application Category

📝 Abstract

Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

Problem

Research questions and friction points this paper is trying to address.

Aligns text, audio, video, motion in joint space

Enhances retrieval via sequence-level contrastive learning

Improves multi-modal motion retrieval performance metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns text, audio, video, motion in joint space

Uses sequence-level contrastive learning for alignment

Augments datasets with synthetic diverse audio

🔎 Similar Papers

No similar papers found.