Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the limited visual parsing capability of multimodal large language models in complex scenes characterized by high density and cluttered backgrounds. To overcome this challenge, the authors propose Mags-RL, a framework that employs a reinforcement learning agent to autonomously localize regions of interest. The approach introduces an external, annotation-free super-resolution “magnifying glass” proxy, enabling a two-stage inference process: an initial answer is generated in the first pass, followed by a refined verification step after cropping and upscaling critical regions. Integrated with a curriculum learning strategy, the model achieves efficient training with only 40 samples. Experimental results demonstrate significant performance gains over existing methods on VSR, TallyQA, and a GQA subset, attaining high-precision visual grounding and robust reasoning in complex visual environments.

📝 Abstract

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

complex scene reasoning

visual grounding

fine-grained image understanding

image interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Reinforcement Learning

Multimodal Large Language Models

Super-resolution