🤖 AI Summary
This work addresses the limitations of large vision-language models in handling long-tail or dynamically evolving visual knowledge queries, which stem from their reliance on static parametric knowledge and the inefficacy of existing retrieval methods due to redundancy and insufficient reasoning depth. To overcome these challenges, the authors propose the Glance-or-Gaze framework, which employs an active visual planning mechanism to dynamically choose between a global overview (Glance) and localized focus (Gaze), augmented by Selective Gaze for adaptive visual attention. The framework adopts a two-stage training strategy: first aligning behaviors via supervised fine-tuning, then optimizing decision-making through complexity-aware reinforcement learning. This paradigm shift from passive perception to active search achieves state-of-the-art performance across six benchmarks, with ablation studies confirming the efficacy of its core components.
📝 Abstract
Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model's capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search. We will release our data and models for further exploration soon.