Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

📅 2026-02-01
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing multimodal large language models in deep emotional understanding, particularly their inability to model the cognitive mechanisms underlying emotion generation. To bridge this gap, the study introduces Theory of Mind (ToM) into multimodal affective reasoning for the first time, proposing HitEmotion—a hierarchical evaluation benchmark designed to diagnose models’ reasoning deficiencies under high cognitive load. Furthermore, the authors develop a ToM-guided chain-of-thought reasoning framework and a process-supervised reinforcement learning method, termed TMPO, which leverages intermediate mental states for training. Experimental results demonstrate that the proposed approach significantly improves both the accuracy and plausibility of emotional reasoning, effectively advancing multimodal models toward cognitively grounded affective understanding.

Technology Category

Application Category

📝 Abstract
Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: https://HitEmotion.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Theory of Mind
Multimodal Emotion Reasoning
Cognitive Depth
Affective Intelligence
Emotional Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Theory of Mind
Multimodal Emotion Reasoning
Hierarchical Benchmark
Reinforcement Learning
Cognitive Modeling
Meng Luo
Meng Luo
National University of Singapore
Human-Centered AIMultimodal UnderstandingMultimodal Reasoning
Bobo Li
Bobo Li
National University of Singapore
Natural Language Processing
S
Shanqing Xu
Huazhong University of Science and Technology
S
Shize Zhang
National University of Singapore
Q
Qiuchan Chen
Huazhong University of Science and Technology
M
Menglu Han
Huazhong University of Science and Technology
W
Wenhao Chen
National University of Singapore
Y
Yanxiang Huang
Hong Kong Polytechnic University
Hao Fei
Hao Fei
National University of Singapore
Vision and LanguageLarge Language ModelNatural Language ProcessingWorld Modeling
M
Mong-Li Lee
National University of Singapore
W
Wynne Hsu
National University of Singapore