RAG-Anything: All-in-One RAG Framework

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing RAG frameworks are largely confined to text modality, struggling to effectively retrieve and reason over real-world documents containing multimodal elements such as images, tables, and mathematical formulas. This work introduces MultiRAG—the first RAG framework supporting unified retrieval and generation across full modalities (text, image, table, formula). Its core innovation lies in constructing a dual-graph architecture: a structural graph modeling document layout and entity relations, and a semantic graph encoding cross-modal semantics; together with a hybrid retrieval mechanism integrating structural navigation and semantic matching. By representing multimodal entities explicitly and aligning them into a unified embedding space, MultiRAG enables fine-grained knowledge navigation and cross-modal semantic matching. It achieves significant improvements over state-of-the-art methods on multimodal benchmarks, especially for long-document scenarios. The code and models are publicly released, advancing RAG toward a truly multimodal paradigm.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists between current RAG capabilities and real-world information environments. Modern knowledge repositories are inherently multimodal, containing rich combinations of textual content, visual elements, structured tables, and mathematical expressions. Yet existing RAG frameworks are limited to textual content, creating fundamental gaps when processing multimodal documents. We present RAG-Anything, a unified framework that enables comprehensive knowledge retrieval across all modalities. Our approach reconceptualizes multimodal content as interconnected knowledge entities rather than isolated data types. The framework introduces dual-graph construction to capture both cross-modal relationships and textual semantics within a unified representation. We develop cross-modal hybrid retrieval that combines structural knowledge navigation with semantic matching. This enables effective reasoning over heterogeneous content where relevant evidence spans multiple modalities. RAG-Anything demonstrates superior performance on challenging multimodal benchmarks, achieving significant improvements over state-of-the-art methods. Performance gains become particularly pronounced on long documents where traditional approaches fail. Our framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems. Our framework is open-sourced at: https://github.com/HKUDS/RAG-Anything.

Problem

Research questions and friction points this paper is trying to address.

Addresses misalignment between RAG capabilities and multimodal information environments

Enables unified knowledge retrieval across text, images, tables, and math

Solves fragmented processing of interconnected multimodal content in documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for multimodal knowledge retrieval

Dual-graph construction capturing cross-modal relationships

Cross-modal hybrid retrieval combining structural and semantic matching

🔎 Similar Papers

FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research