Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

While vision-language models (VLMs) are widely deployed in socially situated tasks, their cross-cultural theory of mind (ToM) reasoning capabilities remain systematically unassessed. Method: We introduce CulturalToM-VQA—the first cross-cultural ToM visual question-answering benchmark—comprising 5,095 culturally grounded questions spanning rituals, attire, gestures, and other culture-specific cues. We propose a taxonomy of six ToM task types and a four-level complexity hierarchy, constructed via a human-expert–guided, VLM-assisted pipeline to ensure cultural sensitivity and annotation reliability. Contribution/Results: Experiments reveal significant performance degradation of mainstream VLMs on non-Western cultural ToM tasks. This work is the first to systematically expose cultural ToM biases in VLMs, providing a reproducible benchmark, analytical tools, and an evaluation framework to advance culturally robust social AI.

Technology Category

Application Category

📝 Abstract

Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluates cross-cultural theory of mind in vision-language models

Probes mental state reasoning across diverse cultural contexts

Assesses ToM beyond Western-centric benchmarks using visual questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-assisted human-in-the-loop pipeline for dataset creation

CulturalToM-VQA benchmark with 5095 cross-cultural questions

Systematic evaluation of six ToM tasks across four complexity levels

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models