Exploring the Feasibility of LLMs for Automated Music Emotion Annotation

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high cost and limited scalability of manual music emotion annotation by conducting the first systematic evaluation of GPT-4o’s feasibility and reliability for automated music emotion labeling. Leveraging the GiantMIDI-Piano MIDI dataset, we generated model annotations within the valence–arousal quadrants framework and performed a multi-dimensional comparison against annotations from three human experts. Evaluation metrics included expert-consensus-weighted accuracy, Cohen’s Kappa, label distribution similarity, and standard accuracy. Results indicate that while GPT-4o exhibits slightly lower overall accuracy and finer-grained emotional discrimination than human experts, its inter-label variability falls within the natural inter-expert disagreement range—demonstrating acceptable reliability. Crucially, GPT-4o achieves this with substantially lower cost and higher throughput. These findings establish GPT-4o as a practically viable alternative for large-scale music emotion annotation, offering a scalable, efficient, and empirically grounded solution to a longstanding bottleneck in affective music computing.

Technology Category

Application Category

📝 Abstract
Current approaches to music emotion annotation remain heavily reliant on manual labelling, a process that imposes significant resource and labour burdens, severely limiting the scale of available annotated data. This study examines the feasibility and reliability of employing a large language model (GPT-4o) for music emotion annotation. In this study, we annotated GiantMIDI-Piano, a classical MIDI piano music dataset, in a four-quadrant valence-arousal framework using GPT-4o, and compared against annotations provided by three human experts. We conducted extensive evaluations to assess the performance and reliability of GPT-generated music emotion annotations, including standard accuracy, weighted accuracy that accounts for inter-expert agreement, inter-annotator agreement metrics, and distributional similarity of the generated labels. While GPT's annotation performance fell short of human experts in overall accuracy and exhibited less nuance in categorizing specific emotional states, inter-rater reliability metrics indicate that GPT's variability remains within the range of natural disagreement among experts. These findings underscore both the limitations and potential of GPT-based annotation: despite its current shortcomings relative to human performance, its cost-effectiveness and efficiency render it a promising scalable alternative for music emotion annotation.
Problem

Research questions and friction points this paper is trying to address.

Assessing GPT-4o for automated music emotion annotation
Comparing GPT-4o annotations with human expert labels
Evaluating cost-effectiveness of LLMs for scalable emotion tagging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using GPT-4o for music emotion annotation
Comparing GPT-4o annotations with human experts
Evaluating reliability via inter-annotator metrics
🔎 Similar Papers
No similar papers found.