🤖 AI Summary
Existing evaluation benchmarks inadequately assess the mixed-initiative interaction capabilities of memory-aware agents in open-world Minecraft. This work proposes the first parameterized task suite constructed from real human co-play data, enabling transparent evaluation of planning, action, and memory coordination mechanisms under fully observable conditions through explicit preconditions, dependency structures, and machine-verifiable criteria. The system integrates GPT-4o as a baseline agent augmented with bounded knowledge strategies, event tracking, and a mixed-initiative clarification mechanism. Evaluation across 216 subtasks involving eight experienced players reveals characteristic failure modes in code execution, item manipulation, reference resolution, and navigation, while demonstrating that lightweight memory and mixed-initiative interaction significantly enhance task recovery. Participants provided positive feedback on interaction quality and interface usability.
📝 Abstract
We present MineNPC-Task, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world Minecraft. Rather than relying on synthetic prompts, tasks are elicited through formative and summative co-play with expert players, then normalized into parametric templates with explicit preconditions and dependency structure. These tasks are paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan, action, and memory events, including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts, and reports outcomes relative to the total number of attempted subtasks using only in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate 216 subtasks across 8 experienced players. We observe recurring breakdown patterns in code execution, inventory and tool handling, referencing, and navigation, alongside successful recoveries supported by mixed-initiative clarifications and lightweight memory use. Participants rated interaction quality and interface usability positively, while noting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and evaluation harness to support transparent and reproducible evaluation of future memory-aware embodied agents.