🤖 AI Summary
This study investigates whether large language models (LLMs) possess genuine emotion inference capabilities grounded in cognitive appraisal theory—not merely superficial emotion recognition. Method: We introduce CoRE, the first systematic, interpretable, standardized benchmark for cognitive appraisal–based emotion reasoning. CoRE encompasses multidimensional cognitive appraisals (e.g., responsibility attribution, goal congruence, perceived control), and integrates model output analysis, dimension-wise correlation testing, and representation-space interpretability techniques to holistically evaluate LLMs’ internal emotion reasoning structure. Contribution/Results: Experiments reveal that diverse LLMs exhibit significant reliance on specific cognitive dimensions; emotion categories are effectively separable and interpretable within the cognitive appraisal space; and models display distinct, quantifiable reasoning preferences. CoRE establishes a novel, reproducible paradigm and toolkit for rigorously assessing and advancing LLMs’ deep emotional understanding.
📝 Abstract
Affective Computing has been established as a crucial field of inquiry to advance the holistic development of Artificial Intelligence (AI) systems. Foundation models -- especially Large Language Models (LLMs) -- have been evaluated, trained, or instruction-tuned in several past works, to become better predictors or generators of emotion. Most of these studies, however, approach emotion-related tasks in a supervised manner, assessing or training the capabilities of LLMs using discrete emotion labels associated with stimuli (e.g., text, images, video, audio). Evaluation studies, in particular, have often been limited to standard and superficial emotion-related tasks, such as the recognition of evoked or expressed emotions. In this paper, we move beyond surface-level emotion tasks to investigate how LLMs reason about emotions through cognitive dimensions. Drawing from cognitive appraisal theory, we examine whether LLMs produce coherent and plausible cognitive reasoning when reasoning about emotionally charged stimuli. We introduce a large-scale benchmark on Cognitive Reasoning for Emotions - CoRE - to evaluate internal cognitive structures implicitly used by LLMs for emotional reasoning. Through a plethora of evaluation experiments and analysis, we seek to answer: (a) Are models more likely to implicitly rely on specific cognitive appraisal dimensions?, (b) What cognitive dimensions are important for characterizing specific emotions?, and, (c) Can the internal representations of different emotion categories in LLMs be interpreted through cognitive appraisal dimensions? Our results and analyses reveal diverse reasoning patterns across different LLMs. Our benchmark and code will be made publicly available.