Large Language Models are Highly Aligned with Human Ratings of Emotional Stimuli

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This study investigates the alignment between large language models (LLMs) and human judgments in affective assessment and its implications for interactive effectiveness. Using multimodal affective data—comprising annotated affective words and associated images—we employed GPT-4o and similar LLMs to generate ratings across valence, arousal, and five basic emotion categories, systematically comparing them against human annotations. Results show that GPT-4o achieves high correlation with human ratings across most dimensions (r ≥ 0.9), particularly excelling in happiness recognition. Its five-dimensional emotion classification outperforms conventional two-dimensional (valence–arousal) models, and inter-model rating homogeneity substantially exceeds inter-human variability. Crucially, this work is the first to demonstrate that LLMs not only approximate population-level human affective responses but also exhibit superior stability relative to individual human differences—thereby providing empirical support for the trustworthy deployment of LLMs in affective intelligent interaction systems.

Technology Category

Application Category

📝 Abstract

Emotions exert an immense influence over human behavior and cognition in both commonplace and high-stress tasks. Discussions of whether or how to integrate large language models (LLMs) into everyday life (e.g., acting as proxies for, or interacting with, human agents), should be informed by an understanding of how these tools evaluate emotionally loaded stimuli or situations. A model's alignment with human behavior in these cases can inform the effectiveness of LLMs for certain roles or interactions. To help build this understanding, we elicited ratings from multiple popular LLMs for datasets of words and images that were previously rated for their emotional content by humans. We found that when performing the same rating tasks, GPT-4o responded very similarly to human participants across modalities, stimuli and most rating scales (r = 0.9 or higher in many cases). However, arousal ratings were less well aligned between human and LLM raters, while happiness ratings were most highly aligned. Overall LLMs aligned better within a five-category (happiness, anger, sadness, fear, disgust) emotion framework than within a two-dimensional (arousal and valence) organization. Finally, LLM ratings were substantially more homogenous than human ratings. Together these results begin to describe how LLM agents interpret emotional stimuli and highlight similarities and differences among biological and artificial intelligence in key behavioral domains.

Problem

Research questions and friction points this paper is trying to address.

Assess LLM alignment with human emotional ratings

Compare AI and human evaluation of emotional stimuli

Determine LLM effectiveness in emotion-based interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated LLM emotional alignment with human ratings

Used multimodal stimuli for cross-modality comparison

Compared categorical versus dimensional emotion frameworks

🔎 Similar Papers

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models