A Survey on Multimodal Music Emotion Recognition

📅 2025-04-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal Music Emotion Recognition (MMER) faces critical challenges including scarce labeled data, limited multimodal corpora, and insufficient real-time performance; existing models remain suboptimal in robustness, scalability, and interpretability. To address these, this work proposes the first structured, four-stage unified framework—comprising data selection, cross-modal feature extraction, feature fusion, and emotion prediction—that systematically integrates heterogeneous modalities (audio, text, visual, and physiological signals) via deep learning–driven co-modelling. Through a comprehensive review of over 100 studies, we delineate the technical evolution trajectory, affirm the centrality of deep learning and advanced fusion strategies, and explicitly identify current bottlenecks and future research directions. The framework delivers both theoretical rigor and practical feasibility, enabling applications in adaptive music recommendation, emotion-aware therapeutic systems, and intelligent entertainment.

Technology Category

Application Category

📝 Abstract
Multimodal music emotion recognition (MMER) is an emerging discipline in music information retrieval that has experienced a surge in interest in recent years. This survey provides a comprehensive overview of the current state-of-the-art in MMER. Discussing the different approaches and techniques used in this field, the paper introduces a four-stage MMER framework, including multimodal data selection, feature extraction, feature processing, and final emotion prediction. The survey further reveals significant advancements in deep learning methods and the increasing importance of feature fusion techniques. Despite these advancements, challenges such as the need for large annotated datasets, datasets with more modalities, and real-time processing capabilities remain. This paper also contributes to the field by identifying critical gaps in current research and suggesting potential directions for future research. The gaps underscore the importance of developing robust, scalable, a interpretable models for MMER, with implications for applications in music recommendation systems, therapeutic tools, and entertainment.
Problem

Research questions and friction points this paper is trying to address.

Surveying state-of-the-art multimodal music emotion recognition techniques
Addressing challenges in dataset size and real-time MMER processing
Identifying gaps for robust and interpretable MMER models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Four-stage MMER framework for emotion recognition
Deep learning advancements in feature extraction
Feature fusion techniques for multimodal data
R
Rashini Liyanarachchi
University of New South Wales, Australia
A
Aditya Joshi
University of New South Wales, Australia
Erik Meijering
Erik Meijering
Professor of Biomedical Image Computing
Artificial IntelligenceComputer VisionDeep LearningImage AnalysisBiomedical Imaging