🤖 AI Summary
This work addresses the limitation of existing benchmarks, which predominantly assess literal theory of mind (ToM) and fail to evaluate agents’ capacity for reasoning and coordination based on implicit beliefs in embodied collaboration—termed functional ToM. We propose and formalize the first evaluation framework for functional ToM, introducing a 3D household environment benchmark comprising 300 dynamically evolving tasks characterized by partial observability, private information, and constrained communication. Tasks are validated for solvability and cognitive depth and can be automatically scaled to higher difficulty levels as model capabilities advance. Experiments reveal that seven state-of-the-art models achieve a 0.0% functional completion rate (Pass³) on challenging tasks despite attaining 45.0% accuracy on literal belief-based questions; 93% of failures stem from breakdowns in cognitive coordination, underscoring the benchmark’s critical role in evaluating ToM in embodied collaborative settings.
📝 Abstract
Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.