LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

📅 2025-01-20

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study addresses the limited accuracy of multimodal sentiment recognition in dialogues due to insufficient audio–text modality collaboration. We propose an LLM-driven hierarchical multimodal modeling framework. First, ASR-derived transcripts are enhanced via large language models to generate high-quality pseudo-labels, enabling robust pretraining of a text-based sentiment classifier. Subsequently, speech embeddings are fused with a hierarchical Transformer architecture to construct a dialogue-structure-aware audio–text joint model. Key contributions include: (1) the first LLM-driven unsupervised pseudo-labeling paradigm for speech transcription; and (2) a novel dialogue-level hierarchical audio–text co-training framework. Our method achieves state-of-the-art performance on IEMOCAP and MELD, and significantly outperforms baselines on CMU-MOSI, demonstrating strong cross-dataset generalization and effective modality complementarity.

Technology Category

Application Category

📝 Abstract

Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pre-trained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP, MELD, and CMU- MOSI, where we illustrate that the proposed model improves over other benchmarks and achieves state-of-the-art results on two out of these three datasets.

Problem

Research questions and friction points this paper is trying to address.

Multimodal emotion recognition in conversations is challenging

Pretraining text models with LLM-generated pseudo-labels from speech

Integrating text and speech embeddings for improved emotion recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-supervised pseudo-labeling for transcripts

Hierarchical speech-text model training

Unsupervised ASR transcripts with LLM guidance

🔎 Similar Papers

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models