Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift

📅 2025-04-06
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited local interpretability of black-box models in speech emotion recognition (SER), particularly the difficulty in identifying discriminative frequency subbands under distributional shift. To this end, we propose EmoLIME—the first LIME-based extension specifically designed for SER. Unlike conventional approaches, EmoLIME operates directly on high-dimensional speech embeddings (e.g., Wav2Vec 2.0) to localize the most influential local frequency subbands for emotion classification and generate high-level semantic attributions. Extensive experiments across three benchmark datasets demonstrate that EmoLIME exhibits strong cross-model robustness—surpassing its cross-dataset robustness—and consistently delivers reliable explanations for both handcrafted-feature-based and pretrained-embedding-based classifiers. Moreover, it significantly enhances decision transparency and trustworthiness of SER models. By bridging model-agnostic interpretability with speech-specific signal structure, EmoLIME establishes a novel paradigm for interpretable speech representation learning.

Technology Category

Application Category

📝 Abstract
We introduce EmoLIME, a version of local interpretable model-agnostic explanations (LIME) for black-box Speech Emotion Recognition (SER) models. To the best of our knowledge, this is the first attempt to apply LIME in SER. EmoLIME generates high-level interpretable explanations and identifies which specific frequency ranges are most influential in determining emotional states. The approach aids in interpreting complex, high-dimensional embeddings such as those generated by end-to-end speech models. We evaluate EmoLIME, qualitatively, quantitatively, and statistically, across three emotional speech datasets, using classifiers trained on both hand-crafted acoustic features and Wav2Vec 2.0 embeddings. We find that EmoLIME exhibits stronger robustness across different models than across datasets with distribution shifts, highlighting its potential for more consistent explanations in SER tasks within a dataset.
Problem

Research questions and friction points this paper is trying to address.

Interpret black-box Speech Emotion Recognition models
Identify influential frequency ranges for emotions
Improve explanation robustness across different datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applies LIME to Speech Emotion Recognition
Identifies influential frequency ranges for emotions
Evaluates robustness across models and datasets
🔎 Similar Papers
No similar papers found.
M
Maja J. Hjuler
University Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France; School of Computer Science, Queensland University of Technology, Brisbane QLD 4000, Australia
Line H. Clemmensen
Line H. Clemmensen
University of Copenhagen
Machine learningmultivariate statisticsstatistical modellingsparse modelling
S
Sneha Das
Dept. of Applied Mathematics and Computer Science, Technical University of Denmark, 2800 Lyngby, Denmark