What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

227K/year
🤖 AI Summary
Existing vision-language model (VLM) agents lack the capacity to actively explore “known unknowns” in sparse-reward environments, hindering robust generalization. To address this limitation, this work proposes GLANCE, a novel framework that leverages visual-linguistic inconsistency as an intrinsic curiosity signal for the first time. GLANCE aligns predictions from a language world model with stable visual representations provided by a target network and uses the resulting discrepancy to drive active exploration. This approach unifies reasoning and exploration mechanisms, enabling agents to proactively seek cognitive challenges that refine their internal models. Experimental results demonstrate that GLANCE significantly improves performance across multiple embodied AI tasks, underscoring the critical role of aligning “what is thought” with “what is seen” in overcoming sparse-reward challenges.
📝 Abstract
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.
Problem

Research questions and friction points this paper is trying to address.

visual-language models
curiosity-driven exploration
world modeling
sparse-reward tasks
epistemic uncertainty
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-linguistic curiosity
world model alignment
intrinsic motivation
VLM agents
reinforcement learning
🔎 Similar Papers
H
Haoxi Li
Department of Computer Science and Engineering, Hong Kong University of Science and Technology (HKUST), Hong Kong, China
Q
Qinglin Hou
Department of Computer Science, University of Southern California, Los Angeles, USA
J
Jianfei Ma
School of Computing, National University of Singapore, Singapore
Jinxiang Lai
Jinxiang Lai
Hong Kong University of Science and Technology (HKUST)
Multimodal LLMFew-Shot LearningComputer Vision
Tao Han
Tao Han
Huazhong University of Science and Technology
Wireless CommunicationsComputer NetworksMultimedia Communications
S
Sikai Bai
Department of Computer Science and Engineering, Hong Kong University of Science and Technology (HKUST), Hong Kong, China
Jingcai Guo
Jingcai Guo
Hong Kong Polytechnic University
Efficient AIZero-Shot LearningEdge AIMachine Learning
Jie Zhang
Jie Zhang
The Chinese University of Hong Kong
Hardware Security & Reliability
Song Guo
Song Guo
Chair Professor of CSE, HKUST
Large Language ModelEdge AIMachine Learning Systems