RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism

๐Ÿ“… 2025-04-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing text-to-video generation methods exhibit notable deficiencies in modeling motion complexity and physical plausibility, frequently yielding static artifacts, jitter, or physically implausible motion. To address this, we propose RAG-Video, the first retrieval-augmented diffusion video generation framework that integrates Retrieval-Augmented Generation (RAG) into the video synthesis pipeline. During diffusion sampling, RAG-Video dynamically retrieves semantically and kinematically similar reference videos and extracts multi-granularity dynamic priors as conditional guidanceโ€”enabling enhanced motion modeling without large-scale retraining. The method is compatible with mainstream text-to-video diffusion architectures and synergistically combines cross-modal retrieval, conditional fine-tuning, and motion-aligned feature integration. Evaluated on novel benchmarksโ€”including MotionScore and PhysEvalโ€”RAG-Video significantly outperforms state-of-the-art approaches, markedly reducing motion distortion while improving temporal coherence and physical realism of generated videos.

Technology Category

Application Category

๐Ÿ“ Abstract
Video generation is experiencing rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging due to the high-dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework to improve the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is designed to apply to any text-to-video diffusion model, conditioning a pretrained model on the retrieved samples with minimal fine-tuning. We demonstrate the superiority of our approach through established metrics, recently proposed benchmarks, and qualitative results, and we highlight additional applications of the framework.
Problem

Research questions and friction points this paper is trying to address.

Enhancing motion realism in generated videos
Addressing unrealistic motion in video generation
Incorporating retrieval mechanisms for better motion demonstration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval mechanism enhances motion realism
Pretrained model conditioned on retrieved samples
Minimal fine-tuning for improved video generation
๐Ÿ”Ž Similar Papers
No similar papers found.