EscapeBench: Pushing Language Models to Think Outside the Box

📅 2024-12-18

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current language model agents excel in goal-directed tasks but lack systematic evaluation of adaptive capabilities—such as implicit goal discovery, creative tool utilization, and iterative problem-solving—in unfamiliar environments. Method: We propose EscapeBench, the first non-goal-directed benchmark for creative adaptation, grounded in room-escape games to assess implicit reasoning and dynamic planning. We introduce EscapeAgent, a framework integrating prospective tool rehearsal and reflective task diagnosis, enhanced by working-memory-augmented chain-of-thought reasoning, dynamically generated action chains, logical consistency maintenance, and multi-level adaptive prompting. Contribution/Results: Experiments show that EscapeAgent reduces task completion steps and prompt dependency by 40%, significantly improves action success rates, supports coherent reasoning over 1,000 steps, and demonstrates strong generalization across diverse environmental configurations.

Technology Category

Application Category

📝 Abstract

Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across varying difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies. All the data and codes are released.

Problem

Research questions and friction points this paper is trying to address.

Challenges agents with creative reasoning in unfamiliar environments

Highlights limitations of current LM models in creativity

Proposes framework to enhance creative reasoning and problem-solving

Innovation

Methods, ideas, or system contributions that make the work stand out.

EscapeBench benchmark for creative reasoning

EscapeAgent with Foresight and Reflection

Efficient 1000-step action chains

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models