High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Redundant visual tokens in high-resolution images degrade inference efficiency and visual grounding accuracy of multimodal large models (MLLMs). Method: We propose an end-to-end reinforcement learning framework that requires no coordinate-level grounding annotations. To mitigate policy cold-start, it employs a multi-turn dialogue template; to optimize grounding, it introduces binary-reward-driven Multi-turn Grounding Policy Optimization (MGPO), jointly regularized by dialogue-constrained policy loss, enabling adaptive image cropping and focus on salient regions. Contribution/Results: This is the first RL-based visual grounding method trained exclusively on coordinate-free data. Experiments show +5.4% and +5.2% absolute gains over GRPO on MME-Realworld and OOD V* Bench, respectively. Qwen2.5-VL-7B surpasses OpenAI o1 and GPT-4o on V* Bench, demonstrating significantly improved in-distribution and out-of-distribution visual understanding generalization.

Technology Category

Application Category

📝 Abstract
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4% improvement on in-distribution MME-Realworld and 5.2% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.
Problem

Research questions and friction points this paper is trying to address.

Improve high-resolution image processing in large multi-modal models
Enable automatic visual grounding without costly annotations
Address cold start in model's autonomous grounding ability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn RL framework for visual grounding
Automatic sub-image cropping via grounding coordinates
Binary reward function for grounding ability