Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

๐Ÿ“… 2026-02-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current embodied agent benchmarks are limited by data contamination and scene-specific biases, hindering reliable evaluation of their capabilities in unseen 3D home environments. This work proposes TEA, a human cognition-inspired dynamic in-situ task generation framework that leverages structured task graphs and a two-stage interaction-evolution mechanism to automatically produce 87,876 human-validated, cognitively plausible tasks across ten previously unseen scenesโ€”without relying on external data. TEA is the first approach to enable environment-driven, diverse task generation and reuse, facilitating authentic in-situ agent evaluation. Experimental results reveal that state-of-the-art models still exhibit significant deficiencies in fundamental perception, 3D interaction awareness, and task robustness.

Technology Category

Application Category

๐Ÿ“ Abstract
As general intelligent agents are poised for widespread deployment in diverse households, evaluation tailored to each unique unseen 3D environment has become a critical prerequisite. However, existing benchmarks suffer from severe data contamination and a lack of scene specificity, inadequate for assessing agent capabilities in unseen settings. To address this, we propose a dynamic in-situ task generation method for unseen environments inspired by human cognition. We define tasks through a structured graph representation and construct a two-stage interaction-evolution task generation system for embodied agents (TEA). In the interaction stage, the agent actively interacts with the environment, creating a loop between task execution and generation that allows for continuous task generation. In the evolution stage, task graph modeling allows us to recombine and reuse existing tasks to generate new ones without external data. Experiments across 10 unseen scenes demonstrate that TEA automatically generated 87,876 tasks in two cycles, which human verification confirmed to be physically reasonable and encompassing essential daily cognitive capabilities. Benchmarking SOTA models against humans on our in-situ tasks reveals that models, despite excelling on public benchmarks, perform surprisingly poorly on basic perception tasks, severely lack 3D interaction awareness and show high sensitivity to task types in reasoning. These sobering findings highlight the necessity of in-situ evaluation before deploying agents into real-world human environments.
Problem

Research questions and friction points this paper is trying to address.

embodied agents
in-situ evaluation
cognitive task generation
unseen environments
3D interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

in-situ task generation
embodied agents
task graph
interactive evolution
3D environment evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.