🤖 AI Summary
Multimodal large language models (MLLMs) face challenges in fine-grained perception: high-resolution image processing incurs substantial computational overhead, supervised approaches rely heavily on large-scale annotated data, and annotation-free methods suffer from low efficiency. To address these issues, we propose Region-of-Interest Self-Distillation (RoI-SD), a fully unsupervised region localization framework that requires no human annotations. RoI-SD leverages denoised intermediate-layer attention maps from the MLLM to generate pseudo-labels, which supervise a lightweight Region Proposal Network (RPN) in a single forward pass—effectively decoupling region detection from text generation. Evaluated on TextVQA, DocVQA, and V-Star benchmarks, RoI-SD achieves over a 10 percentage-point absolute accuracy gain using only 10K question-answer pairs, significantly improving fine-detail recognition, cross-domain generalization, and data efficiency.
📝 Abstract
Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations.To validate our approach, we integrate the framework into the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.