A Study on the Data Distribution Gap in Music Emotion Recognition

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Music emotion recognition (MER) suffers from weak cross-dataset generalization, genre bias, and dataset bias due to subjective annotations and imbalanced genre distributions. Method: We systematically analyze five multi-genre, multi-dimensional emotion datasets, revealing substantial influences of genre and data-source biases on feature representations. We propose a lightweight fusion framework that jointly leverages Jukebox audio embeddings and chroma features, trained via multi-dataset joint optimization to mitigate distributional shift. Results: Cross-dataset evaluation across multiple public MER benchmarks demonstrates significant improvements in out-of-distribution generalization—achieving an average accuracy gain of 4.2% over state-of-the-art methods—while enhancing model robustness and practical applicability. Our core contribution is the first empirical validation of multi-source joint training as critical for improving out-of-distribution generalization in MER, complemented by an interpretable analysis framework for quantifying genre-specific biases.

Technology Category

Application Category

📝 Abstract

Music Emotion Recognition (MER) is a task deeply connected to human perception, relying heavily on subjective annotations collected from contributors. Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres, such as rock and classical, within a single framework. In this paper, we address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations -- EmoMusic, DEAM, PMEmo, WTC, and WCMED -- which span various musical styles. We demonstrate the problem of out-of-distribution generalization in a systematic experiment. By closely looking at multiple data and feature sets, we provide insight into genre-emotion relationships in existing data and examine potential genre dominance and dataset biases in certain feature representations. Based on these experiments, we arrive at a simple yet effective framework that combines embeddings extracted from the Jukebox model with chroma features and demonstrate how, alongside a combination of several diverse training sets, this permits us to train models with substantially improved cross-dataset generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Investigating out-of-distribution generalization in music emotion recognition

Addressing genre dominance and dataset biases in feature representations

Improving cross-dataset generalization using multimodal embeddings and diverse training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining Jukebox embeddings with chroma features

Using multiple diverse training datasets together

Training models for improved cross-dataset generalization

🔎 Similar Papers

Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges