🤖 AI Summary
Existing multimodal retrieval-augmented generation (MRAG) methods face three key challenges: static retrieval timing, inflexible modality selection, and inefficient utilization of retrieved information. This paper proposes Windsock, the first adaptive MRAG framework that enables query-driven determination of retrieval necessity and dynamic multimodal selection. To further address data scarcity and noise sensitivity, we introduce DANCE—a dynamic, noise-robust instruction-tuning method—that autonomously transforms standard question-answering data into high-quality MRAG training examples by leveraging the model’s intrinsic knowledge, and incorporates a self-evaluation mechanism to construct robust training sets. Experiments demonstrate that our approach maintains generation quality while improving response accuracy by 17.07%, reducing average retrieval count by 8.95%, and significantly enhancing computational efficiency and noise robustness.
📝 Abstract
Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.