🤖 AI Summary
Existing RAG methods for video content creation suffer from two key limitations: (1) retrieval lacks query-driven adaptivity, and (2) videos are reduced to static textual descriptions, failing to model their intrinsic temporal visual structure. This work introduces the first end-to-end Video Retrieval-Augmented Generation (Video-RAG) framework, enabling query-adaptive dynamic video retrieval and jointly modeling frame sequences and temporal text to enhance factual consistency in generation. Our approach builds upon Large Video-Language Models (LVLMs), integrating frame-level encoding, temporal attention mechanisms, and instruction tuning to support cross-modal retrieval and direct injection of retrieved video content into the generative process. Evaluated on multi-task video question answering and factual video captioning benchmarks, Video-RAG achieves a 12.7% absolute improvement in factual accuracy over text- and image-based RAG baselines. To our knowledge, this is the first work to empirically validate the necessity and efficacy of leveraging video—not just as a static modality, but as a dynamic knowledge source—in retrieval-augmented generation.
📝 Abstract
Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.