MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Insufficient motion realism in image-to-video generation stems primarily from the difficulty of universally modeling physical constraints, object interactions, and domain-specific dynamics. To address this, we propose a retrieval-augmented, context-aware motion transfer framework. Our method employs a video encoder and a dedicated resampler to extract high-order motion priors, leverages a causal Transformer for motion-context learning, and introduces an attention-injection-based motion adapter that seamlessly integrates retrieved reference motion into a pre-trained diffusion model. The framework enables zero-shot cross-domain generalization; its modular motion database is plug-and-play and requires no fine-tuning. Extensive experiments demonstrate significant improvements in motion fidelity across multiple foundation models and diverse scenarios, with negligible increase in inference overhead. The approach thus achieves both high efficiency and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.
Problem

Research questions and friction points this paper is trying to address.

Generating videos with realistic motion from images
Modeling complex physical constraints and object interactions
Generalizing motion dynamics across diverse scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-based pipeline extracts semantic motion features
Causal transformer enables in-context motion adaptation learning
Attention-based adapter injects motion into diffusion models
🔎 Similar Papers
No similar papers found.
Chenhui Zhu
Chenhui Zhu
Lawrence Berkeley National Lab
Soft and Functional MaterialsSynchrotron X-Ray Science
Yilu Wu
Yilu Wu
Nanjing University
Computer Vision
S
Shuai Wang
State Key Laboratory for Novel Software Technology, Nanjing University
G
Gangshan Wu
State Key Laboratory for Novel Software Technology, Nanjing University
L
Limin Wang
State Key Laboratory for Novel Software Technology, Nanjing University; Shanghai AI Laboratory