π€ AI Summary
Current evaluations of theory of mind (ToM) in large language models (LLMs) are largely confined to textual inputs and belief-related tasks, offering an incomplete assessment of their social cognition capabilities. To address this limitation, this work proposes CoMMETβthe first multimodal ToM benchmark specifically designed for multi-turn dialogues, encompassing a diverse range of mental states and moral judgment tasks. Through the construction of multimodal stimuli, interactive dialogue scenarios, and systematic model evaluation, CoMMET reveals the boundaries of existing LLMs in complex social reasoning. The benchmark provides empirical evidence and actionable insights for advancing the social intelligence of artificial agents, highlighting both current shortcomings and promising directions for future development.
π Abstract
Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.