🤖 AI Summary
This work addresses the challenge that multimodal large language models often struggle with fine-grained visual perception due to interference from global image context, hindering their ability to focus on critical small regions. To overcome this, the authors propose a region-to-image knowledge distillation approach that internalizes the dynamic zooming capability of “thinking with images” into the training paradigm. Specifically, a strong teacher model generates high-quality VQA labels from tightly cropped regions, which are then distilled into a student model that processes the full image in a single forward pass—enabling accurate fine-grained perception without the computational overhead of iterative zooming during inference. The study also introduces ZoomBench, a new evaluation benchmark, along with a dual-view assessment protocol. The proposed method achieves state-of-the-art performance across multiple fine-grained perception tasks and significantly enhances general multimodal capabilities, including visual reasoning and GUI agent interaction.
📝 Abstract
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent"Thinking-with-Images"methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves"single-glance"fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional"zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when"Thinking-with-Images"is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.