🤖 AI Summary
This work addresses the inefficiency of multimodal large language models in processing high-resolution images, where global resizing introduces excessive redundant visual tokens and overlooks spatial sparsity and query intent. The authors propose Q-Zoom, a novel query-aware adaptive high-resolution framework that employs a lightweight dynamic gating network to determine whether high-resolution processing is needed. It further integrates a self-distilled region proposal network (SD-RPN) to accurately localize task-relevant regions, enabling coarse-to-fine fusion of local details and global layout. Leveraging consistency-aware deterministic routing labels and a fully self-supervised distillation paradigm, Q-Zoom achieves efficient and precise visual token selection. Evaluated on Qwen2.5-VL-7B, it accelerates inference by 2.52× on document and OCR tasks and 4.39× in high-resolution scenarios without accuracy loss; under maximum fidelity settings, it improves performance by 1.1% and 8.1%, and generalizes effectively to Qwen3-VL and LLaVA architectures.
📝 Abstract
MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.