Reinforcing Trustworthiness in Multimodal Emotional Support Systems

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary multimodal affective support systems suffer from insufficient modality utilization (e.g., discarding raw audiovisual data in favor of text-only inputs), coarse-grained emotion recognition, and responses lacking clinical consistency and credibility. To address these limitations, we propose MultiMood—a novel framework that, for the first time, integrates psychological assessment principles into both multimodal affect understanding and response generation. MultiMood jointly encodes raw video, audio, and text features, models fine-grained emotional components via multimodal embedding, and employs reinforcement learning to optimize a large language model for therapeutic compliance. Evaluated on MESC and DFEW benchmarks, MultiMood achieves state-of-the-art performance. Dual-path evaluation—combining human expert judgment and LLM-based assessment—demonstrates significant improvements over baselines in empathic accuracy, clinical consistency, and response credibility.

Technology Category

Application Category

📝 Abstract
In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce extsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in multimodal emotional support systems
Improving trustworthiness through psychological criteria and reinforcement learning
Enhancing emotion recognition and response generation with multimodal inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multimodal embeddings from video audio text
Incorporates psychological criteria with reinforcement learning
Optimizes large language models for therapeutic standards
🔎 Similar Papers
No similar papers found.
H
Huy M. Le
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Dat Tien Nguyen
Dat Tien Nguyen
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
N
Ngan T. T. Vo
University of Information Technology
T
Tuan D. Q. Nguyen
University of Information Technology
N
Nguyen Binh Le
University of Information Technology
D
D. M. Nguyen
German Research Center for Artificial Intelligence (DFKI)
Daniel Sonntag
Daniel Sonntag
DFKI and University of Oldenburg
Interactive Machine LearningIntelligent User InterfacesMultimodal Interaction
Lizi Liao
Lizi Liao
Singapore Management University
Conversational AgentsMultimedia AnalysisText Mining
Binh T. Nguyen
Binh T. Nguyen
VinUniversity
statisticsoptimal transport