🤖 AI Summary
Existing Theory of Mind (ToM) evaluation frameworks focus predominantly on English, overlooking how linguistic diversity affects mental-state reasoning. Method: We introduce “Multilingual Theory of Mind” (Multilingual ToM) and present XToM—the first cross-lingual ToM benchmark—covering Chinese, English, French, Spanish, and Japanese across diverse cognitive scenarios. Task templates are grounded in cognitive science principles and constructed via human annotation, expert validation, and multilingual prompt engineering; evaluation follows zero-shot and few-shot protocols. Contribution/Results: Experiments reveal that state-of-the-art large language models (e.g., DeepSeek R1), despite strong multilingual comprehension, exhibit significant cross-lingual variance in ToM performance—indicating a lack of human-level multilingual generalization in mental-state inference. This work provides the first empirical evidence of ToM’s linguistic dependence and establishes both a novel benchmark and a theoretical foundation for multilingual cognitive modeling.
📝 Abstract
Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.