🤖 AI Summary
Current evaluation methods for large language model agents rely excessively on task success rates, failing to adequately assess their genuine understanding of the environment. To address this limitation, this work proposes the Task-to-Quiz (T2Q) evaluation paradigm and introduces T2QBench—a benchmark comprising 30 environments and 1,967 question-answer pairs—that decouples task execution from environmental state comprehension for the first time. The framework systematically evaluates agents’ world models through deterministic question generation, multi-difficulty environment modeling, fine-grained state representations, and an automated evaluation pipeline. Experimental results reveal that even when agents successfully complete tasks, they generally lack transferable environmental understanding, and existing memory mechanisms prove insufficient for effective environment modeling.
📝 Abstract
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks, yet their ability to generalize across varying environments remains a under-examined concern. Current evaluation paradigms predominantly rely on trajectory-based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment. To address this gap, we propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding. We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels. Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment. These findings identify proactive exploration and fine-grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents.