MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

๐Ÿ“… 2025-11-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing embodied agents lack theory of mind (ToM)-driven decision-making, and mainstream benchmarks model only human mental states while neglecting the agentโ€™s own perspective, leading to behavioral incoherence. To address this, we propose Robot-Centric, the first framework enabling vision-language models (VLMs) to jointly reason about both *self* and *other* mental states. We further introduce Mind-Rewardโ€”a novel multimodal reward mechanism that unifies perceptual grounding, mental state modeling, and reinforcement learning. Our approach significantly improves decision-action consistency: it outperforms GPT-4o by 12.77% and 12.49% on two core embodied reasoning tasks. By grounding agent behavior in interpretable, generalizable cognitive representations, Robot-Centric establishes a principled foundation for socially situated interaction in embodied AI.

Technology Category

Application Category

๐Ÿ“ Abstract
Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.
Problem

Research questions and friction points this paper is trying to address.

Enables embodied agents to infer human mental states like beliefs and intentions
Addresses lack of Theory-of-Mind reasoning in current vision-language agents
Integrates perception, reasoning, and action for coherent decision-making
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robot-Centric framework integrating Perception, Reasoning, Decision, Action
ToM Reasoning modeling both self and others' mental states
Mind-Reward optimization objective for consistent reasoning and behavior
๐Ÿ”Ž Similar Papers
No similar papers found.