🤖 AI Summary
Contemporary multimodal affective support systems suffer from insufficient modality utilization (e.g., discarding raw audiovisual data in favor of text-only inputs), coarse-grained emotion recognition, and responses lacking clinical consistency and credibility. To address these limitations, we propose MultiMood—a novel framework that, for the first time, integrates psychological assessment principles into both multimodal affect understanding and response generation. MultiMood jointly encodes raw video, audio, and text features, models fine-grained emotional components via multimodal embedding, and employs reinforcement learning to optimize a large language model for therapeutic compliance. Evaluated on MESC and DFEW benchmarks, MultiMood achieves state-of-the-art performance. Dual-path evaluation—combining human expert judgment and LLM-based assessment—demonstrates significant improvements over baselines in empathic accuracy, clinical consistency, and response credibility.
📝 Abstract
In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce extsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.