MERGE - A Bimodal Dataset for Static Music Emotion Recognition

📅 2024-07-08
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of large-scale, publicly available audio–lyrics bimodal datasets for Music Emotion Recognition (MER), this paper introduces MERGE—the first open-source, large-scale bimodal dataset supporting static emotion classification. We propose a semi-automated construction pipeline and release three aligned subsets: audio-only, lyrics-only, and audio–lyrics bimodal, along with standardized train/validation/test splits. For modeling, we systematically benchmark conventional handcrafted features with SVM/RF against deep unimodal (CNN, LSTM) and bimodal fusion architectures, including a novel dual-stream deep neural network. Our dual-stream model achieves a 79.21% macro-F1 score on MERGE, significantly outperforming prior approaches and validating the dataset’s utility. MERGE establishes a new benchmark for bimodal MER research and enables reproducible, large-scale evaluation of multimodal emotion modeling techniques.

Technology Category

Application Category

📝 Abstract
The Music Emotion Recognition (MER) field has seen steady developments in recent years, with contributions from feature engineering, machine learning, and deep learning. The landscape has also shifted from audio-centric systems to bimodal ensembles that combine audio and lyrics. However, a severe lack of public and sizeable bimodal databases has hampered the development and improvement of bimodal audio-lyrics systems. This article proposes three new audio, lyrics, and bimodal MER research datasets, collectively called MERGE, created using a semi-automatic approach. To comprehensively assess the proposed datasets and establish a baseline for benchmarking, we conducted several experiments for each modality, using feature engineering, machine learning, and deep learning methodologies. In addition, we propose and validate fixed train-validate-test splits. The obtained results confirm the viability of the proposed datasets, achieving the best overall result of 79.21% F1-score for bimodal classification using a deep neural network.
Problem

Research questions and friction points this paper is trying to address.

Lack of public bimodal datasets for music emotion recognition.
Need for combining audio and lyrics in emotion recognition systems.
Development of MERGE datasets to improve bimodal classification accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automatic creation of bimodal datasets
Integration of audio and lyrics for emotion recognition
Deep neural networks achieve 79.21% F1-score
🔎 Similar Papers
No similar papers found.
P
P. Louro
University of Coimbra, Centre for Informatics and Systems of the University of Coimbra (CISUC), Department of Informatics Engineering, and LASI
H
Hugo Redinho
University of Coimbra, Centre for Informatics and Systems of the University of Coimbra (CISUC), Department of Informatics Engineering, and LASI
Ricardo Santos
Ricardo Santos
PhD on Biomedical Engineering, Senior Scientist at Fraunhofer Portugal AICOS
Medical AIMultimodal MLIndoor Location
R
Ricardo Malheiro
CISUC, LASI, and Polytechnic Institute of Leiria - School of Technology and Management
R
R. Panda
CISUC, LASI, and Ci2 - Smart Cities Research Center, Polytechnic Institute of Tomar
R
R. Paiva
University of Coimbra, Centre for Informatics and Systems of the University of Coimbra (CISUC), Department of Informatics Engineering, and LASI