🤖 AI Summary
Existing motion retrieval methods suffer from unintuitive user interaction and insufficient modeling of temporal alignment across multimodal sequences. To address these limitations, we propose the first fine-grained joint embedding framework integrating text, audio, video, and motion—introducing audio as a novel modality to enhance interactivity, immersion, and accessibility. We design a sequence-level contrastive learning mechanism that explicitly models cross-modal temporal correspondences, enabling more precise multimodal alignment. Furthermore, we incorporate synthetic audio data augmentation and optimize a unified similarity metric. Evaluated on the HumanML3D benchmark, our method achieves +10.16% in text-to-motion Recall@10 and +25.43% in video-to-motion Recall@1. The four-modal approach significantly outperforms three-modal baselines, empirically validating the critical roles of audio integration and explicit temporal sequence modeling in multimodal motion retrieval.
📝 Abstract
Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.