Uncertainty Quantification for Multimodal Retrieval Augmented Generation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing uncertainty quantification methods struggle to effectively assess reliability issues arising from multiple stages—such as retrieval, visual understanding, and text generation—in multimodal Retrieval-Augmented Generation (RAG) systems. This work proposes LeMUQ, the first approach to jointly model uncertainties from both retrieval and multimodal components within multimodal RAG. LeMUQ encodes retrieval- and modality-aware uncertainty into learnable probabilistic tokens by analyzing token-level probability shifts under input perturbations, such as modality removal or retrieval context ablation, and further refines these representations through fine-tuning to capture cross-modal and retrieval interactions. Experimental results demonstrate that LeMUQ achieves an average AUROC improvement of 3.8% across diverse datasets, retrievers, and vision-language models, highlighting its strong generalization capability.

📝 Abstract

Retrieval Augmented Generation (RAG) improves the question answering capabilities of Large Language Models (LLMs) by incorporating external knowledge and has recently been extended to multimodal settings through Vision-Language Models (VLMs) that integrate visual and textual information. Despite these advances, generated answers can still be incorrect or misleading. Uncertainty Quantification (UQ) methods aim to estimate the reliability of model outputs, but most existing approaches are designed for text-only models and perform poorly in multimodal RAG scenarios. A key challenge is capturing uncertainty arising from multiple stages of the pipeline, including retrieval, visual understanding, and generation. In this work, we show that modeling uncertainty using multimodal and retrieval-aware probability signals improves estimation in multimodal RAG systems. We introduce LeMUQ, a Learnable Multimodal UQ method that analyzes token probabilities under input modifications, such as removing modalities or retrieved context. By encoding these signals as probability tokens and processing them with a finetuned model, our approach captures interactions between modalities and retrieval. Experiments across datasets, retrievers, and VLMs show consistent improvements over baseline and finetuned UQ methods. Our proposed LeMUQ increases the AUROC metric by 3.8% on average. Additionally, our method shows strong generalization performance across different retrieval setups and datasets with mixed results when transferring across different VLMs. Our findings highlight the importance of modeling multimodal uncertainty and provide a step toward more reliable and safer multimodal RAG systems. Code is available on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Uncertainty Quantification

Multimodal Retrieval Augmented Generation

Vision-Language Models

Reliability Estimation

Multimodal Uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty Quantification

Multimodal RAG

Vision-Language Models