🤖 AI Summary
This work addresses the limitations of existing general-purpose multimodal retrieval methods, which rely on static visual encodings and often fail to perform reliable reasoning under visually ambiguous conditions, leading to speculative errors. To overcome this, we propose the first evidence-driven, agent-based multimodal retrieval framework that reframes retrieval as an interleaved reasoning process alternating between hypothesis generation and targeted visual verification. By dynamically invoking external vision tools to acquire fine-grained evidence and integrating curriculum learning, supervised reasoning activation, rejection-based refinement, and evidence-alignment-guided reinforcement learning, our approach substantially enhances both reasoning reliability and generalization capability. Empirical evaluations demonstrate an average 23.0% improvement in retrieval accuracy across multiple benchmarks.
📝 Abstract
Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.