Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing motion retrieval methods suffer from unintuitive user interaction and insufficient modeling of temporal alignment across multimodal sequences. To address these limitations, we propose the first fine-grained joint embedding framework integrating text, audio, video, and motion—introducing audio as a novel modality to enhance interactivity, immersion, and accessibility. We design a sequence-level contrastive learning mechanism that explicitly models cross-modal temporal correspondences, enabling more precise multimodal alignment. Furthermore, we incorporate synthetic audio data augmentation and optimize a unified similarity metric. Evaluated on the HumanML3D benchmark, our method achieves +10.16% in text-to-motion Recall@10 and +25.43% in video-to-motion Recall@1. The four-modal approach significantly outperforms three-modal baselines, empirically validating the critical roles of audio integration and explicit temporal sequence modeling in multimodal motion retrieval.

Technology Category

Application Category

📝 Abstract
Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.
Problem

Research questions and friction points this paper is trying to address.

Aligns text, audio, video, motion in joint space
Enhances retrieval via sequence-level contrastive learning
Improves multi-modal motion retrieval performance metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns text, audio, video, motion in joint space
Uses sequence-level contrastive learning for alignment
Augments datasets with synthetic diverse audio
🔎 Similar Papers
No similar papers found.
S
Shiyao Yu
Southern University of Science and Technology, and jointly with Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Z
Zi-An Wang
University of Chinese Academy of Sciences, and jointly with Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Kangning Yin
Kangning Yin
Shanghai Jiao Tong University
roboticshumanoidembodied ai
Zheng Tian
Zheng Tian
University College London
Reinforcement LearningMulti-Agent Reinforcement LearningMachine Learning
M
Mingyuan Zhang
Nanyang Technological University, Singapore
Weixin Si
Weixin Si
Shenzhen University of Advanced Technology
Mixed RealityPhysically Based ModelingMedical Data Analysis
Shihao Zou
Shihao Zou
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
computer vision