๐ค AI Summary
Existing embodied agents lack theory of mind (ToM)-driven decision-making, and mainstream benchmarks model only human mental states while neglecting the agentโs own perspective, leading to behavioral incoherence. To address this, we propose Robot-Centric, the first framework enabling vision-language models (VLMs) to jointly reason about both *self* and *other* mental states. We further introduce Mind-Rewardโa novel multimodal reward mechanism that unifies perceptual grounding, mental state modeling, and reinforcement learning. Our approach significantly improves decision-action consistency: it outperforms GPT-4o by 12.77% and 12.49% on two core embodied reasoning tasks. By grounding agent behavior in interpretable, generalizable cognitive representations, Robot-Centric establishes a principled foundation for socially situated interaction in embodied AI.
๐ Abstract
Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.