🤖 AI Summary
In fine-grained visual question answering (VQA), multimodal large language models (MLLMs) struggle to precisely localize small, critical image regions; existing visual cropping methods rely on task-specific fine-tuning, exhaustive search, or compromise attention compatibility. This paper proposes FOCUS: a training-free, plug-and-play zero-shot visual cropping method. It leverages KV caches from MLLM forward inference to generate entity-relevance heatmaps, then integrates prompt-guided target identification, differentiable region proposal, and ranking for efficient local cropping and re-inference. FOCUS establishes the first zero-shot cropping paradigm grounded in internal MLLM representations—eliminating fine-tuning and brute-force search while fully preserving standard attention mechanisms. Evaluated on four fine-grained VQA benchmarks across two MLLM architectures, FOCUS matches the accuracy of the state-of-the-art baseline ZoomEye while accelerating inference by 3–6.5×.
📝 Abstract
While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and two types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.