Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing multimodal RAG (mRAG) methods suffer from two key limitations in real-world applications such as news analysis: inflexible retrieval strategies and insufficient exploitation of visual information. To address these, this paper proposes E-Agent—a novel framework featuring a synergistic planner-executor architecture. The planner performs one-shot dynamic retrieval planning to minimize redundant LLM invocations, while the executor enables context-aware multimodal tool orchestration and tool-aware execution sequence generation, explicitly modeling retrieval dependencies. To rigorously evaluate dynamic decision-making capabilities, we introduce RemPlan—the first benchmark tailored to realistic scenarios. Extensive experiments demonstrate that E-Agent achieves an average 13% accuracy gain over state-of-the-art methods across RemPlan and three established benchmarks, while reducing redundant searches by 37%.

Technology Category

Application Category

📝 Abstract

Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios like news analysis and trending topics. However, existing approaches often suffer from rigid retrieval strategies and under-utilization of visual information. To bridge this gap, we propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning, and a task executor employing tool-aware execution sequencing to implement optimized mRAG workflows. E-Agent adopts a one-time mRAG planning strategy that enables efficient information retrieval while minimizing redundant tool invocations. To rigorously assess the planning capabilities of mRAG systems, we introduce the Real-World mRAG Planning (RemPlan) benchmark. This novel benchmark contains both retrieval-dependent and retrieval-independent question types, systematically annotated with essential retrieval tools required for each instance. The benchmark's explicit mRAG planning annotations and diverse question design enhance its practical relevance by simulating real-world scenarios requiring dynamic mRAG decisions. Experiments across RemPlan and three established benchmarks demonstrate E-Agent's superiority: 13% accuracy gain over state-of-the-art mRAG methods while reducing redundant searches by 37%.

Problem

Research questions and friction points this paper is trying to address.

Optimizing rigid retrieval strategies in multimodal RAG systems

Addressing under-utilization of visual information in mRAG

Minimizing redundant tool invocations in multimodal planning workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic multimodal tool orchestration through contextual reasoning

Tool-aware execution sequencing for optimized workflows

One-time planning minimizing redundant tool invocations

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models