Multimodal Large Language Models for End-to-End Affective Computing: Benchmarking and Boosting with Generative Knowledge Prompting

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal affective computing (MAC) suffers from unstable performance and insufficient understanding of how model architectures and data characteristics jointly influence performance. To address this, we propose a hybrid optimization framework integrating generative knowledge prompting, cross-modal alignment, and supervised fine-tuning, accompanied by a systematic benchmark for comprehensive evaluation of state-of-the-art open-source multimodal large language models (MLLMs) on audio-visual-text fusion-based emotion recognition. Extensive experiments across multiple standard benchmarks demonstrate significant improvements in end-to-end emotion analysis accuracy and robustness. This work provides the first empirical characterization of the synergistic interplay between architectural design choices and data properties in multimodal emotion understanding, establishing an interpretable and reproducible paradigm for MAC model development. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Multimodal Affective Computing (MAC) aims to recognize and interpret human emotions by integrating information from diverse modalities such as text, video, and audio. Recent advancements in Multimodal Large Language Models (MLLMs) have significantly reshaped the landscape of MAC by offering a unified framework for processing and aligning cross-modal information. However, practical challenges remain, including performance variability across complex MAC tasks and insufficient understanding of how architectural designs and data characteristics impact affective analysis. To address these gaps, we conduct a systematic benchmark evaluation of state-of-the-art open-source MLLMs capable of concurrently processing audio, visual, and textual modalities across multiple established MAC datasets. Our evaluation not only compares the performance of these MLLMs but also provides actionable insights into model optimization by analyzing the influence of model architectures and dataset properties. Furthermore, we propose a novel hybrid strategy that combines generative knowledge prompting with supervised fine-tuning to enhance MLLMs' affective computing capabilities. Experimental results demonstrate that this integrated approach significantly improves performance across various MAC tasks, offering a promising avenue for future research and development in this field. Our code is released on https://github.com/LuoMSen/MLLM-MAC.
Problem

Research questions and friction points this paper is trying to address.

Evaluate MLLMs' performance on multimodal affective computing tasks
Analyze impact of model architectures on affective analysis
Enhance MLLMs' emotion recognition via generative knowledge prompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking MLLMs across multiple MAC datasets
Combining generative prompting with fine-tuning
Analyzing model architectures and dataset impacts
🔎 Similar Papers
No similar papers found.
M
Miaosen Luo
School of Computer Science, South China Normal University
J
Jiesen Long
School of Computer Science, South China Normal University
Z
Zequn Li
School of Computer Science, South China Normal University
Y
Yunying Yang
School of Information Technology in Education, South China Normal University
Yuncheng Jiang
Yuncheng Jiang
West China Hospital, Sichuan University
Computer VisionMedical Image Analysis
S
Sijie Mai
School of Computer Science, South China Normal University