MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

📅 2024-08-22
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modeling multi-agent mental states and interaction intentions in realistic, complex embodied environments remains a significant challenge for AI. Method: This paper introduces MuMA-ToM—the first multimodal Theory of Mind (ToM) benchmark for embodied multi-agent interaction—featuring video-text data and tasks requiring goal inference, belief reasoning, and nested belief reasoning within domestic settings. We propose LIMP, the first model to systematically model and evaluate nested multi-agent mental states from multimodal inputs, integrating inverse multi-agent planning, multimodal fusion, inverse reinforcement learning, and LLM-driven reasoning. Contribution/Results: Experiments demonstrate that LIMP significantly outperforms state-of-the-art models—including GPT-4o, Gemini-1.5 Pro, and BIP-ALM—on MuMA-ToM, achieving performance approaching human-level accuracy. This work establishes a foundational benchmark and methodology for evaluating ToM in embodied multi-agent systems.

Technology Category

Application Category

📝 Abstract
Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.
Problem

Research questions and friction points this paper is trying to address.

AI Emotion Recognition
Complex Real-life Scenarios
Multi-person Interaction Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

MuMA-ToM
LIMP
multi-person interaction
🔎 Similar Papers
No similar papers found.
H
Haojun Shi
Johns Hopkins University
S
Suyu Ye
Johns Hopkins University
X
Xinyu Fang
Johns Hopkins University
C
Chuanyang Jin
Johns Hopkins University
L
Layla Isik
Johns Hopkins University
Yen-Ling Kuo
Yen-Ling Kuo
University of Virginia
Artificial IntelligenceRoboticsHuman-AI/Robot Interaction
Tianmin Shu
Tianmin Shu
Assistant Professor, JHU
Artificial IntelligenceCognitive Science