🤖 AI Summary
While vision-language models (VLMs) are widely deployed in socially situated tasks, their cross-cultural theory of mind (ToM) reasoning capabilities remain systematically unassessed. Method: We introduce CulturalToM-VQA—the first cross-cultural ToM visual question-answering benchmark—comprising 5,095 culturally grounded questions spanning rituals, attire, gestures, and other culture-specific cues. We propose a taxonomy of six ToM task types and a four-level complexity hierarchy, constructed via a human-expert–guided, VLM-assisted pipeline to ensure cultural sensitivity and annotation reliability. Contribution/Results: Experiments reveal significant performance degradation of mainstream VLMs on non-Western cultural ToM tasks. This work is the first to systematically expose cultural ToM biases in VLMs, providing a reproducible benchmark, analytical tools, and an evaluation framework to advance culturally robust social AI.
📝 Abstract
Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.