M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing multimodal RAG (MRAG) methods support only unimodal output, limiting their capability for multimodal interaction and complex reasoning. To address this, we propose MRAMG—a novel framework that introduces reinforcement learning to controllable image insertion in multimodal generation for the first time. MRAMG features a lightweight 3B-parameter Inserter-R1 model and employs Group Relative Policy Optimization (GRPO) to jointly optimize semantic alignment, image selection, and spatial layout control. The framework enables end-to-end integration of multimodal retrieval, reasoning, and generation, supporting both multimodal input and multimodal output. Experiments demonstrate that Inserter-R1 significantly outperforms baselines in generation quality, inference steps, and latency—validating its feasibility and state-of-the-art performance for efficient, controllable multimodal content generation in real-world applications.

Technology Category

Application Category

📝 Abstract

Current research on Multimodal Retrieval-Augmented Generation (MRAG) enables diverse multimodal inputs but remains limited to single-modality outputs, restricting expressive capacity and practical utility. In contrast, real-world applications often demand both multimodal inputs and multimodal outputs for effective communication and grounded reasoning. Motivated by the recent success of Reinforcement Learning (RL) in complex reasoning tasks for Large Language Models (LLMs), we adopt RL as a principled and effective paradigm to address the multi-step, outcome-driven challenges inherent in multimodal output generation. Here, we introduce M2IO-R1, a novel framework for Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) that supports both multimodal inputs and outputs. Central to our framework is an RL-based inserter, Inserter-R1-3B, trained with Group Relative Policy Optimization to guide image selection and placement in a controllable and semantically aligned manner. Empirical results show that our lightweight 3B inserter achieves strong reasoning capabilities with significantly reduced latency, outperforming baselines in both quality and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enabling multimodal outputs for MRAG to enhance expressive capacity

Addressing multi-step challenges in multimodal output generation using RL

Improving image selection and placement with controllable semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-enhanced multimodal generation framework

Group Relative Policy Optimization training

Lightweight 3B inserter for efficiency

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models