🤖 AI Summary
Existing multimodal RAG (MRAG) methods support only unimodal output, limiting their capability for multimodal interaction and complex reasoning. To address this, we propose MRAMG—a novel framework that introduces reinforcement learning to controllable image insertion in multimodal generation for the first time. MRAMG features a lightweight 3B-parameter Inserter-R1 model and employs Group Relative Policy Optimization (GRPO) to jointly optimize semantic alignment, image selection, and spatial layout control. The framework enables end-to-end integration of multimodal retrieval, reasoning, and generation, supporting both multimodal input and multimodal output. Experiments demonstrate that Inserter-R1 significantly outperforms baselines in generation quality, inference steps, and latency—validating its feasibility and state-of-the-art performance for efficient, controllable multimodal content generation in real-world applications.
📝 Abstract
Current research on Multimodal Retrieval-Augmented Generation (MRAG) enables diverse multimodal inputs but remains limited to single-modality outputs, restricting expressive capacity and practical utility. In contrast, real-world applications often demand both multimodal inputs and multimodal outputs for effective communication and grounded reasoning. Motivated by the recent success of Reinforcement Learning (RL) in complex reasoning tasks for Large Language Models (LLMs), we adopt RL as a principled and effective paradigm to address the multi-step, outcome-driven challenges inherent in multimodal output generation. Here, we introduce M2IO-R1, a novel framework for Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) that supports both multimodal inputs and outputs. Central to our framework is an RL-based inserter, Inserter-R1-3B, trained with Group Relative Policy Optimization to guide image selection and placement in a controllable and semantically aligned manner. Empirical results show that our lightweight 3B inserter achieves strong reasoning capabilities with significantly reduced latency, outperforming baselines in both quality and efficiency.