Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning

📅 2026-01-20
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large vision-language models in handling long-tail or dynamically evolving visual knowledge queries, which stem from their reliance on static parametric knowledge and the inefficacy of existing retrieval methods due to redundancy and insufficient reasoning depth. To overcome these challenges, the authors propose the Glance-or-Gaze framework, which employs an active visual planning mechanism to dynamically choose between a global overview (Glance) and localized focus (Gaze), augmented by Selective Gaze for adaptive visual attention. The framework adopts a two-stage training strategy: first aligning behaviors via supervised fine-tuning, then optimizing decision-making through complexity-aware reinforcement learning. This paradigm shift from passive perception to active search achieves state-of-the-art performance across six benchmarks, with ablation studies confirming the efficacy of its core components.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model's capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search. We will release our data and models for further exploration soon.
Problem

Research questions and friction points this paper is trying to address.

Large Multimodal Models
knowledge-intensive queries
visual redundancy
complex visual queries
search-augmented approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Gaze
Reinforcement Learning
Search-Augmented LMMs
Visual Planning
Complexity-Adaptive Training
🔎 Similar Papers
No similar papers found.
H
Hongbo Bai
Hong Kong University of Science and Technology
Y
Yujin Zhou
Hong Kong University of Science and Technology
Y
Yile Wu
Hong Kong University of Science and Technology
Chi-Min Chan
Chi-Min Chan
HKUST
Large Language ModelsPost-TrainingAlignmentLLM Agents
P
Pengcheng Wen
Hong Kong University of Science and Technology
Kunhao Pan
Kunhao Pan
International Digital Economy Academy
large language modelvision language model
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology