Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing pretrained models for music emotion recognition—such as MIDIBERT—neglect modal structure, despite strong empirical evidence from music psychology that modality critically influences emotional perception. Method: This work first systematically identifies MIDIBERT’s representational deficiencies in capturing mode–emotion associations and proposes a music-psychology-inspired modal guidance framework: Mode-guided Feature-wise Linear Modulation (MoFi), which injects interpretable tonal priors at intermediate layers via layer-adaptive feature modulation. MoFi integrates symbolic modal analysis with feature-level linear modulation to enhance emotion-discriminative representation learning. Contribution/Results: Evaluated on EMOPIA and VGMIDI, MoFi achieves 75.2% and 59.1% accuracy, respectively—substantially outperforming baselines. These results empirically validate the effectiveness and necessity of modal guidance for symbolic music emotion recognition (SMER).

Technology Category

Application Category

📝 Abstract

Music emotion recognition is a key task in symbolic music understanding (SMER). Recent approaches have shown promising results by fine-tuning large-scale pre-trained models (e.g., MIDIBERT, a benchmark in symbolic music understanding) to map musical semantics to emotional labels. While these models effectively capture distributional musical semantics, they often overlook tonal structures, particularly musical modes, which play a critical role in emotional perception according to music psychology. In this paper, we investigate the representational capacity of MIDIBERT and identify its limitations in capturing mode-emotion associations. To address this issue, we propose a Mode-Guided Enhancement (MoGE) strategy that incorporates psychological insights on mode into the model. Specifically, we first conduct a mode augmentation analysis, which reveals that MIDIBERT fails to effectively encode emotion-mode correlations. We then identify the least emotion-relevant layer within MIDIBERT and introduce a Mode-guided Feature-wise linear modulation injection (MoFi) framework to inject explicit mode features, thereby enhancing the model's capability in emotional representation and inference. Extensive experiments on the EMOPIA and VGMIDI datasets demonstrate that our mode injection strategy significantly improves SMER performance, achieving accuracies of 75.2% and 59.1%, respectively. These results validate the effectiveness of mode-guided modeling in symbolic music emotion recognition.

Problem

Research questions and friction points this paper is trying to address.

Enhances symbolic music emotion recognition by injecting tonal mode features

Addresses MIDIBERT's limitation in capturing mode-emotion psychological associations

Improves model accuracy via mode-guided feature modulation in emotion inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mode-guided feature injection into MIDIBERT

Augmenting model with psychological mode-emotion insights

Enhancing emotional representation via mode feature modulation

🔎 Similar Papers

Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges