Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

📅 2024-09-26

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 3

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study investigates acoustic similarity between emotional speech and music in self-supervised learning (SSL) representation spaces and its implications for cross-domain emotion recognition. We first identify a significant emotion bias across layers of wav2vec 2.0 and BEATs models via comparative analysis of emotion representations. To address this, we propose a two-stage parameter-efficient fine-tuning paradigm—based on LoRA—for bidirectional knowledge transfer between speech and music domains. Additionally, we introduce the Fréchet Audio Distance to quantify emotion-level acoustic alignment. Experiments demonstrate that SSL models for speech and music share low-level emotion-correlated acoustic cues. On RAVDESS and GTZAN, cross-domain adaptation improves Speech Emotion Recognition (SER) and Music Emotion Recognition (MER) accuracy by 3.2% and 2.7%, respectively, validating the cross-modal transferability of emotion representations.

Technology Category

Application Category

📝 Abstract

Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.

Problem

Research questions and friction points this paper is trying to address.

Exploring shared acoustic cues between emotional speech and music using SSL models

Investigating cross-domain adaptation for Speech and Music Emotion Recognition

Analyzing emotion bias in SSL models for speech and music

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing SSL model layers for SER and MER

Cross-domain adaptation via two-stage fine-tuning

Measuring acoustic similarity using Frechet audio distance

🔎 Similar Papers

Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges