MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech emotion recognition (SER) suffers from high annotation costs, subjective labeling biases, and lack of contextual information. To address these challenges, this paper proposes a fully automated, text-prompt-driven multimodal emotion labeling paradigm—eliminating manual intervention and cross-modal alignment. Leveraging GPT-4o, we perform end-to-end emotion annotation on *Friends* video clips for the first time. Through structured prompt engineering and synergistic optimization with self-supervised speech representation models (Wav2Vec 2.0, HuBERT, wavLM, data2vec), we construct MELT—the first high-quality multimodal emotion dataset entirely generated by a large language model. Experiments demonstrate that MELT significantly improves performance across multiple SER benchmarks; human evaluation confirms high inter-annotator consistency and semantic plausibility; downstream task validation further verifies its enhanced generalization capability. This work pioneers an LLM-driven, unsupervised paradigm for multimodal emotion labeling.

Technology Category

Application Category

📝 Abstract
Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments' results demonstrate a consistence performance improvement on SER.
Problem

Research questions and friction points this paper is trying to address.

Automating multimodal emotion data annotation using LLMs
Reducing cost and inconsistency in human emotion labeling
Exploring LLMs' potential for unsupervised speech emotion annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging GPT-4o for automated emotion annotation
Using structured text prompts for multimodal data
Creating fully LLM-annotated dataset MELT