MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing embodied agents lack theory of mind (ToM)-driven decision-making, and mainstream benchmarks model only human mental states while neglecting the agent’s own perspective, leading to behavioral incoherence. To address this, we propose Robot-Centric, the first framework enabling vision-language models (VLMs) to jointly reason about both *self* and *other* mental states. We further introduce Mind-Reward—a novel multimodal reward mechanism that unifies perceptual grounding, mental state modeling, and reinforcement learning. Our approach significantly improves decision-action consistency: it outperforms GPT-4o by 12.77% and 12.49% on two core embodied reasoning tasks. By grounding agent behavior in interpretable, generalizable cognitive representations, Robot-Centric establishes a principled foundation for socially situated interaction in embodied AI.

Technology Category

Application Category

📝 Abstract

Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.

Problem

Research questions and friction points this paper is trying to address.

Enables embodied agents to infer human mental states like beliefs and intentions

Addresses lack of Theory-of-Mind reasoning in current vision-language agents

Integrates perception, reasoning, and action for coherent decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

Robot-Centric framework integrating Perception, Reasoning, Decision, Action

ToM Reasoning modeling both self and others' mental states

Mind-Reward optimization objective for consistent reasoning and behavior

🔎 Similar Papers

No similar papers found.