🤖 AI Summary
This work addresses the insufficient evaluation of large language models (LLMs) on complex, multi-step planning tasks. To this end, we introduce Plancraft—the first multimodal planning benchmark tailored to the Minecraft crafting GUI—supporting both textual and visual inputs. Plancraft integrates a Wikipedia-based knowledge base for retrieval-augmented generation (RAG) evaluation, incorporates an ablatable oracle planner and RAG extractor, and annotates task solvability—including deliberately unsolvable instances—to systematically assess agents’ planning, tool-use, and solvability judgment capabilities. Its core innovation lies in unifying solvability determination, multimodal planning, RAG enhancement, and component-level ablation analysis within a single framework. Experiments reveal that state-of-the-art LLMs and vision-language models (VLMs) significantly underperform rule-based planners in long-horizon reasoning, resource dependency modeling, and failure anticipation—exposing fundamental limitations in their planning intelligence.
📝 Abstract
We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as an oracle planner and oracle RAG information extractor, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and strategies on our task and compare their performance to a handcrafted planner. We find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and we offer suggestions on how to improve their capabilities.