MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MMRAG methods lack interpretable reasoning modeling for the retrieval and generation processes, undermining result credibility. This paper proposes a two-stage reinforcement fine-tuning framework: Stage I employs rule-driven optimization of retrieval ranking; Stage II adopts reasoning-driven joint optimization of retrieval and generation, explicitly outputting structured reasoning chains. The framework introduces a novel synergistic mechanism—rule-based anchor-point ranking coupled with reasoning-based listwise ranking—enabling, for the first time, end-to-end interpretable joint optimization of multimodal retrieval and answer generation. Evaluated on WebQA and MultimodalQA, it achieves state-of-the-art performance. Ablation studies confirm the significant contributions of each module to both interpretability and accuracy.

Technology Category

Application Category

📝 Abstract
Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.
Problem

Research questions and friction points this paper is trying to address.

Enhances reasoning for explainable multi-modal retrieval-augmented generation
Improves document ranking and answer generation via two-stage reinforcement fine-tuning
Addresses lack of explainability in existing multi-modal retrieval-augmented methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage reinforcement fine-tuning for explainable MMRAG
Rule-based coarse ranking filters irrelevant multimodal documents
Reasoning-based fine ranking jointly optimizes retrieval and generation
🔎 Similar Papers
No similar papers found.
S
Shengwei Zhao
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jingwen Yao
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
S
Sitong Wei
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
L
Linhai Xu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Y
Yuying Liu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
D
Dong Zhang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Zhiqiang Tian
Zhiqiang Tian
Xi'an Jiaotong University
Computer VisionMedical Image AnalysisRobotics
Shaoyi Du
Shaoyi Du
Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Pattern RecognitionComputer VisionImage Processing