CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

๐Ÿ“… 2026-05-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

225K/year
๐Ÿค– AI Summary
Existing visual search methods for high-resolution images struggle to balance coverage and computational efficiency, often resulting in perceptual blind spots or semantic fragmentation. This work proposes CVSearch, a training-free, adaptive framework that dynamically orchestrates expert-assisted proposals and semantic-aware scanning through an โ€œevaluate-and-re-searchโ€ mechanism. Its key innovations include semantic-guided adaptive image tiling, a bottom-up search strategy driven by visual complexity priors, and a cognition-inspired scheduling policy. Evaluated on multiple high-resolution benchmarks, CVSearch achieves state-of-the-art accuracy while significantly improving search efficiency.
๐Ÿ“ Abstract
High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26-CVSearch.
Problem

Research questions and friction points this paper is trying to address.

high-resolution image perception
multimodal large language models
visual search
coverage-efficiency trade-off
semantic fragmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognitive Visual Search
Semantic Guided Adaptive Patching
Dynamic Bottom-Up Search
Multimodal LLMs
High-Resolution Image Perception