Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

High-resolution visual inputs often contain substantial redundancy, yet existing approaches rely on costly human annotations to identify critical regions. To address this limitation, this work proposes HART, a framework that achieves, for the first time, annotation-free high-resolution visual reasoning. HART employs a reinforcement learning–based closed-loop mechanism enabling large vision-language models to self-supervise the focusing on and verification of relevant image regions. It introduces Advantage-based Preference Group Relative Policy Optimization (AP-GRPO), an algorithm that efficiently enhances region localization capabilities within a post-training paradigm. Beyond delivering interpretable reasoning pathways, HART significantly outperforms strong baselines across multiple high-resolution benchmarks; notably, a HART-finetuned Qwen2.5-VL-7B surpasses even larger models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B.

Technology Category

Application Category

📝 Abstract

Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Large Multimodal Models

High-Resolution Visual Reasoning

Annotation-Free

Visual Grounding

Token Redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotation-Free

High-Resolution Visual Reasoning

Reinforcement Learning