🤖 AI Summary
This work addresses the inefficiency and inaccurate localization commonly observed in training-free multimodal large language models during visual reasoning, which stem from perceptual redundancy and misalignment between semantic intent and spatial attention. To overcome these limitations, the authors propose AdaFocus—a training-free, adaptive visual reasoning framework that dynamically determines whether and where to crop an image through a two-stage process. The core innovation lies in a confidence-driven cropping decision mechanism coupled with a semantics-guided spatial localization strategy, which jointly mitigates redundant computation and attention misalignment without requiring any additional training. Experimental results demonstrate that AdaFocus achieves approximately 4.0× faster inference than the state-of-the-art method ZoomEyes while significantly improving accuracy, all while preserving its zero-training advantage.
📝 Abstract
Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which adds overhead and noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose AdaFocus, a novel training-free framework designed for adaptive visual reasoning. AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop. This enables adaptive visual reasoning without additional training. Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0\times speedup inference speedup than the SOTA method ZoomEyes, representing a significant advance in both accuracy and efficiency.