๐ค AI Summary
Current agents exhibit underutilization, misinterpretation of predictions, or performance degradation when employing generative world models for prospective reasoning. This work presents the first systematic evaluation of vision-language modelโbased agentsโ ability to strategically invoke and integrate world model predictions across multitask scenarios, combining visual question answering with agent benchmark tasks and introducing attribution analysis to quantitatively measure simulation usage. The study reveals that agents proactively invoke simulations in fewer than 1% of cases, approximately 15% of predictions are misused, and enforced simulation use can degrade performance by up to 5%. These findings expose a cognitive bottleneck in how agents interpret and incorporate predictive information, underscoring the urgent need for calibrated mechanisms to enable reliable prospective reasoning.
๐ Abstract
Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents'capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.