Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the susceptibility of rerankers in multimodal retrieval-augmented generation to visual distractions—such as irrelevant image backgrounds—that introduce bias into relevance scoring. To mitigate this, the authors propose Region-R1, a novel framework that formulates query-side region selection as a reinforcement learning decision problem, dynamically choosing between preserving the full image or focusing on question-relevant regions to enhance evidence relevance. Central to this approach is the region-aware grouped relative policy optimization (r-GRPO) algorithm, which enables discriminative region selection. Evaluated on the E-VQA and InfoSeek benchmarks, Region-R1 achieves state-of-the-art performance, with conditional Recall@1 improvements of up to 20% over existing methods.

📝 Abstract

Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.

Problem

Research questions and friction points this paper is trying to address.

multi-modal re-ranking

visual distractors

query-side region cropping

retrieval-augmented generation

image-question queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

region cropping

multi-modal re-ranking

query-side adaptation

r-GRPO

retrieval-augmented generation

🔎 Similar Papers

No similar papers found.

Authors to Follow