Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

πŸ“… 2026-05-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

172K/year
πŸ€– AI Summary
This work addresses the limitations of existing visual evidence selection methods, which often rely on semantic relevance or superficial similarity and fail to align with the actual utility for downstream tasks. From an information-theoretic perspective, the authors redefine multimodal evidence selection by modeling evidence utility as the information gain in the model’s output distribution, enabling efficient ranking through gains computed in a latent variable space. The key contributions include a lightweight, training-free utility evaluation framework, a theoretical equivalence between latent-space gains and answer-space utility, and integration with training-free inference acceleration techniques. Experimental results demonstrate that the proposed method significantly outperforms current RAG approaches on MRAG-Bench and Visual-RAG while substantially reducing computational overhead.
πŸ“ Abstract
Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.
Problem

Research questions and friction points this paper is trying to address.

visual evidence selection
multimodal retrieval-augmented generation
evidence utility
information gain
downstream reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

utility-oriented evidence selection
multimodal retrieval-augmented generation
information-theoretic utility
latent helpfulness
training-free framework