🤖 AI Summary
Music Emotion Recognition (MER) faces a critical challenge of cross-dataset label heterogeneity—where emotion annotations are either categorical (e.g., happy/sad) or dimensional (e.g., valence-arousal). This paper proposes the first end-to-end unified multi-task framework that jointly models both label types. Our method integrates MERT-based audio representations with symbolic music features (e.g., key, chords) and incorporates knowledge distillation to transfer expertise from single-dataset teacher models, substantially enhancing cross-domain generalization. Crucially, it enables joint learning and unified optimization of categorical and dimensional emotion representations—a novel capability not previously achieved. Extensive experiments on four benchmark datasets—including MTG-Jamendo and DEAM—demonstrate consistent state-of-the-art performance. Notably, our approach outperforms the MediaEval 2021 champion model on MTG-Jamendo, validating its effectiveness in handling heterogeneous emotion annotations and improving robustness across diverse domains.
📝 Abstract
One of the most significant challenges in Music Emotion Recognition (MER) comes from the fact that emotion labels can be heterogeneous across datasets with regard to the emotion representation, including categorical (e.g., happy, sad) versus dimensional labels (e.g., valence-arousal). In this paper, we present a unified multitask learning framework that combines these two types of labels and is thus able to be trained on multiple datasets. This framework uses an effective input representation that combines musical features (i.e., key and chords) and MERT embeddings. Moreover, knowledge distillation is employed to transfer the knowledge of teacher models trained on individual datasets to a student model, enhancing its ability to generalize across multiple tasks. To validate our proposed framework, we conducted extensive experiments on a variety of datasets, including MTG-Jamendo, DEAM, PMEmo, and EmoMusic. According to our experimental results, the inclusion of musical features, multitask learning, and knowledge distillation significantly enhances performance. In particular, our model outperforms the state-of-the-art models, including the best-performing model from the MediaEval 2021 competition on the MTG-Jamendo dataset. Our work makes a significant contribution to MER by allowing the combination of categorical and dimensional emotion labels in one unified framework, thus enabling training across datasets.