🤖 AI Summary
Current language model agents excel in goal-directed tasks but lack systematic evaluation of adaptive capabilities—such as implicit goal discovery, creative tool utilization, and iterative problem-solving—in unfamiliar environments. Method: We propose EscapeBench, the first non-goal-directed benchmark for creative adaptation, grounded in room-escape games to assess implicit reasoning and dynamic planning. We introduce EscapeAgent, a framework integrating prospective tool rehearsal and reflective task diagnosis, enhanced by working-memory-augmented chain-of-thought reasoning, dynamically generated action chains, logical consistency maintenance, and multi-level adaptive prompting. Contribution/Results: Experiments show that EscapeAgent reduces task completion steps and prompt dependency by 40%, significantly improves action success rates, supports coherent reasoning over 1,000 steps, and demonstrates strong generalization across diverse environmental configurations.
📝 Abstract
Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across varying difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies. All the data and codes are released.