🤖 AI Summary
Existing vision-language model (VLM) agents lack the capacity to actively explore “known unknowns” in sparse-reward environments, hindering robust generalization. To address this limitation, this work proposes GLANCE, a novel framework that leverages visual-linguistic inconsistency as an intrinsic curiosity signal for the first time. GLANCE aligns predictions from a language world model with stable visual representations provided by a target network and uses the resulting discrepancy to drive active exploration. This approach unifies reasoning and exploration mechanisms, enabling agents to proactively seek cognitive challenges that refine their internal models. Experimental results demonstrate that GLANCE significantly improves performance across multiple embodied AI tasks, underscoring the critical role of aligning “what is thought” with “what is seen” in overcoming sparse-reward challenges.
📝 Abstract
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.