Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

📅 2024-11-26
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pre-trained language models (PLMs) struggle to effectively integrate audiovisual cues and external knowledge in multimodal dialogue emotion recognition. Method: We propose Lantern, a novel framework featuring domain-aware receptive-field slicing and attention-driven large language model (LLM) prompting, enabling dynamic collaborative calibration between LLMs (e.g., GPT-4, Llama-3.1-405B) and multimodal foundation models (e.g., CORECT, SDT). Lantern employs domain-aware attention-weighted fusion of multimodal features to guide LLMs in refining emotion classification probabilities. Contribution/Results: On IEMOCAP, Lantern achieves absolute accuracy improvements of 1.23% (4-class) and 1.80% (6-class) over state-of-the-art methods. It establishes an interpretable, scalable paradigm for enhancing multimodal emotion understanding via LLMs, bridging modality-specific representations with high-level semantic reasoning.

Technology Category

Application Category

📝 Abstract
Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia information. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMOCAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-modal emotion recognition using LLMs with attention weighting
Overcoming limitations of text-only LLMs by integrating multimedia features
Improving emotion prediction accuracy through receptive-field-aware fusion methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompting LLMs with receptive-field-aware attention weighting
Slicing dialogue into receptive fields for processing
Merging LLM predictions using attention-driven weighting module