EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limitation of existing benchmarks, which predominantly assess literal theory of mind (ToM) and fail to evaluate agents’ capacity for reasoning and coordination based on implicit beliefs in embodied collaboration—termed functional ToM. We propose and formalize the first evaluation framework for functional ToM, introducing a 3D household environment benchmark comprising 300 dynamically evolving tasks characterized by partial observability, private information, and constrained communication. Tasks are validated for solvability and cognitive depth and can be automatically scaled to higher difficulty levels as model capabilities advance. Experiments reveal that seven state-of-the-art models achieve a 0.0% functional completion rate (Pass³) on challenging tasks despite attaining 45.0% accuracy on literal belief-based questions; 93% of failures stem from breakdowns in cognitive coordination, underscoring the benchmark’s critical role in evaluating ToM in embodied collaborative settings.

📝 Abstract

Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.

Problem

Research questions and friction points this paper is trying to address.

Theory of Mind

functional ToM

embodied agents

multi-agent collaboration

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

functional Theory of Mind

embodied agents

multi-agent collaboration

evolving benchmark

epistemic coordination

🔎 Similar Papers

Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective

2024-10-08Citations: 0

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

2024-08-22arXiv.orgCitations: 0