🤖 AI Summary
This study addresses the limited accuracy of multimodal sentiment recognition in dialogues due to insufficient audio–text modality collaboration. We propose an LLM-driven hierarchical multimodal modeling framework. First, ASR-derived transcripts are enhanced via large language models to generate high-quality pseudo-labels, enabling robust pretraining of a text-based sentiment classifier. Subsequently, speech embeddings are fused with a hierarchical Transformer architecture to construct a dialogue-structure-aware audio–text joint model. Key contributions include: (1) the first LLM-driven unsupervised pseudo-labeling paradigm for speech transcription; and (2) a novel dialogue-level hierarchical audio–text co-training framework. Our method achieves state-of-the-art performance on IEMOCAP and MELD, and significantly outperforms baselines on CMU-MOSI, demonstrating strong cross-dataset generalization and effective modality complementarity.
📝 Abstract
Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pre-trained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP, MELD, and CMU- MOSI, where we illustrate that the proposed model improves over other benchmarks and achieves state-of-the-art results on two out of these three datasets.