🤖 AI Summary
Speech emotion recognition (SER) suffers from high annotation costs, subjective labeling biases, and lack of contextual information. To address these challenges, this paper proposes a fully automated, text-prompt-driven multimodal emotion labeling paradigm—eliminating manual intervention and cross-modal alignment. Leveraging GPT-4o, we perform end-to-end emotion annotation on *Friends* video clips for the first time. Through structured prompt engineering and synergistic optimization with self-supervised speech representation models (Wav2Vec 2.0, HuBERT, wavLM, data2vec), we construct MELT—the first high-quality multimodal emotion dataset entirely generated by a large language model. Experiments demonstrate that MELT significantly improves performance across multiple SER benchmarks; human evaluation confirms high inter-annotator consistency and semantic plausibility; downstream task validation further verifies its enhanced generalization capability. This work pioneers an LLM-driven, unsupervised paradigm for multimodal emotion labeling.
📝 Abstract
Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments' results demonstrate a consistence performance improvement on SER.