🤖 AI Summary
Music emotion recognition (MER) suffers from weak cross-dataset generalization, genre bias, and dataset bias due to subjective annotations and imbalanced genre distributions. Method: We systematically analyze five multi-genre, multi-dimensional emotion datasets, revealing substantial influences of genre and data-source biases on feature representations. We propose a lightweight fusion framework that jointly leverages Jukebox audio embeddings and chroma features, trained via multi-dataset joint optimization to mitigate distributional shift. Results: Cross-dataset evaluation across multiple public MER benchmarks demonstrates significant improvements in out-of-distribution generalization—achieving an average accuracy gain of 4.2% over state-of-the-art methods—while enhancing model robustness and practical applicability. Our core contribution is the first empirical validation of multi-source joint training as critical for improving out-of-distribution generalization in MER, complemented by an interpretable analysis framework for quantifying genre-specific biases.
📝 Abstract
Music Emotion Recognition (MER) is a task deeply connected to human perception, relying heavily on subjective annotations collected from contributors. Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres, such as rock and classical, within a single framework. In this paper, we address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations -- EmoMusic, DEAM, PMEmo, WTC, and WCMED -- which span various musical styles. We demonstrate the problem of out-of-distribution generalization in a systematic experiment. By closely looking at multiple data and feature sets, we provide insight into genre-emotion relationships in existing data and examine potential genre dominance and dataset biases in certain feature representations. Based on these experiments, we arrive at a simple yet effective framework that combines embeddings extracted from the Jukebox model with chroma features and demonstrate how, alongside a combination of several diverse training sets, this permits us to train models with substantially improved cross-dataset generalization capabilities.