Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether multimodal large language models (MLLMs)—such as Gemini and GPT—can effectively model the high-dimensional affective structures elicited in humans during video viewing. Method: We propose a novel paradigm based on Gromov–Wasserstein optimal transport to quantify structural alignment—across videos—between MLLM-generated affective representations and human self-reported affective ratings (via dimensional scales), moving beyond conventional pointwise similarity metrics. Contribution/Results: Our analysis reveals strong structural consistency between MLLM outputs and human affect at the emotion-category level (high inter-structure correlation) with robust generalizability; however, significant discrepancies persist at the single-video granularity. By enabling cross-video relational assessment, our framework provides a scalable, multi-level quantitative evaluation methodology for affective understanding in MLLMs. It thus delineates both the current boundaries and untapped potential of MLLMs in modeling human affect, offering a principled foundation for future advances in affect-aware multimodal AI.

Technology Category

Application Category

📝 Abstract
Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and model-inferred emotion structures. To further explore whether the similarity between humans and models is at the signle item level or the coarse-categorical level, we applied Gromov Wasserstein Optimal Transport. We found that although performance was not necessarily high at the strict, single-item level, performance across video categories that elicit similar emotions was substantial, indicating that the model could infer human emotional experiences at the category level. Our results suggest that current state-of-the-art MLLMs broadly capture the complex high-dimensional emotion structures at the category level, as well as their apparent limitations in accurately capturing entire structures at the single-item level.
Problem

Research questions and friction points this paper is trying to address.

Comparing high-dimensional emotion structures between humans and MLLMs
Assessing MLLMs' ability to capture nuanced human emotions from videos
Evaluating category-level vs single-item emotion inference accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used Multimodal LLMs to analyze emotion structures
Compared human and model emotion ratings via videos
Applied Gromov Wasserstein Optimal Transport analysis
🔎 Similar Papers
No similar papers found.
H
Haruka Asanuma
The University of Tokyo, Graduate School of Arts and Sciences, Tokyo, 153-8902, Japan
N
Naoko Koide-Majima
Center for Information and Neural Networks (CiNet), National Institute of Information and Communications Technology, Osaka, 565-0871, Japan; The University of Osaka, Graduate School of Frontier Biosciences, Osaka, 565-0871, Japan
Ken Nakamura
Ken Nakamura
The University of Tokyo
Takato Horii
Takato Horii
The University of Osaka, Japan
Robotics
Shinji Nishimoto
Shinji Nishimoto
Osaka University Graduate School of Frontier Biosciences
NeuroscienceNeurophysiologyNatural visionBrain decoding
Masafumi Oizumi
Masafumi Oizumi
The University of Tokyo, Graduate School of Arts and Sciences, Tokyo, 153-8902, Japan