VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the limited out-of-domain generalization and poor out-of-vocabulary action generation in Motion Large Models (MLMs) caused by scarce annotated data, this paper proposes VimoRAG—a video retrieval-augmented framework for 3D motion generation. Methodologically, it introduces a Gemini-driven motion-semantic video retrieval mechanism to achieve precise, action-centric matching of 2D motion signals, and designs a motion-centric dual-alignment DPO trainer that explicitly models the cross-modal mapping from 2D pose to 3D motion while suppressing retrieval-induced errors. The framework integrates four key modules: video action retrieval, 2D pose extraction, 3D motion generation, and reinforcement learning–based alignment. Experiments demonstrate that VimoRAG significantly improves text-to-motion generation quality under pure textual prompting, achieving state-of-the-art performance across multiple objective metrics—including Joint Position Error (JPE), Mean Per-Joint Position Error (MPJPE), and Fréchet Inception Distance (FID)—as well as in human evaluations.

Technology Category

Application Category

📝 Abstract

This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.

Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D motion generation for motion LLMs using video retrieval

Addressing out-of-domain issues in motion LLMs with limited annotated data

Mitigating error propagation from suboptimal video retrieval results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-based retrieval for 3D motion generation

Gemini Motion Video Retriever mechanism

Motion-centric Dual-alignment DPO Trainer

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning