🤖 AI Summary
This work addresses the challenges of evaluating and modeling exploration-driven decision-making in AI agents for dynamic virtual escape rooms. We introduce VisEscape, a novel benchmark comprising 20 dynamically evolving escape rooms, which establishes the first systematic evaluation paradigm for exploration-driven decision-making. Methodologically, we propose VisEscaper—a unified framework integrating multimodal perception, external memory storage, environment feedback–guided replanning, and a ReAct-style reasoning–action loop—to enable active construction of spatiotemporal knowledge and self-correcting action execution. Experiments demonstrate that VisEscaper achieves a 5.0× improvement in average escape efficiency and a 3.7× increase in task completion rate over state-of-the-art multimodal models. This work provides both a rigorous new benchmark and a scalable architectural foundation for exploration-aware planning in embodied intelligence.
📝 Abstract
Escape rooms present a unique cognitive challenge that demands exploration-driven planning: players should actively search their environment, continuously update their knowledge based on new discoveries, and connect disparate clues to determine which elements are relevant to their objectives. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observed that even state-of-the-art multimodal models generally fail to escape the rooms, showing considerable variation in their levels of progress and trajectories. To address this issue, we propose VisEscaper, which effectively integrates Memory, Feedback, and ReAct modules, demonstrating significant improvements by performing 3.7 times more effectively and 5.0 times more efficiently on average.