🤖 AI Summary
Current vision-language models exhibit limited performance in multi-step visual interactive tasks, struggling to effectively integrate perception, memory, and action—particularly over extended decision-making horizons. To address this, this work proposes VisGym, the first systematic evaluation and training platform for multimodal agents, encompassing 17 diverse environments that support tasks ranging from symbolic reasoning and real-image understanding to navigation and manipulation. The platform offers flexible configurations in difficulty, input modality, planning horizon, and feedback mechanisms. Leveraging a structured multi-step solver to generate supervision signals—augmented with goal observations, textual feedback, and exploratory demonstrations—our experiments demonstrate substantial performance gains (from 46.6% on easy tasks to significant improvements, and from 26.0% on hard tasks), while uncovering a critical bottleneck: current models’ inability to effectively leverage long-horizon contextual history.
📝 Abstract
Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.