LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited accuracy of multimodal sentiment recognition in dialogues due to insufficient audio–text modality collaboration. We propose an LLM-driven hierarchical multimodal modeling framework. First, ASR-derived transcripts are enhanced via large language models to generate high-quality pseudo-labels, enabling robust pretraining of a text-based sentiment classifier. Subsequently, speech embeddings are fused with a hierarchical Transformer architecture to construct a dialogue-structure-aware audio–text joint model. Key contributions include: (1) the first LLM-driven unsupervised pseudo-labeling paradigm for speech transcription; and (2) a novel dialogue-level hierarchical audio–text co-training framework. Our method achieves state-of-the-art performance on IEMOCAP and MELD, and significantly outperforms baselines on CMU-MOSI, demonstrating strong cross-dataset generalization and effective modality complementarity.

Technology Category

Application Category

📝 Abstract
Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pre-trained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP, MELD, and CMU- MOSI, where we illustrate that the proposed model improves over other benchmarks and achieves state-of-the-art results on two out of these three datasets.
Problem

Research questions and friction points this paper is trying to address.

Multimodal emotion recognition in conversations is challenging
Pretraining text models with LLM-generated pseudo-labels from speech
Integrating text and speech embeddings for improved emotion recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-supervised pseudo-labeling for transcripts
Hierarchical speech-text model training
Unsupervised ASR transcripts with LLM guidance
🔎 Similar Papers
No similar papers found.
Soumya Dutta
Soumya Dutta
Assistant Professor of Computer Science at IIT Kanpur
Machine LearningVisual ComputingxAIData ScienceHPC
S
Sriram Ganapathy
LEAP Lab, Electrical Engineering, Indian Institute of Science Bangalore, Bangalore, India