Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift

📅 2025-04-06

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limited local interpretability of black-box models in speech emotion recognition (SER), particularly the difficulty in identifying discriminative frequency subbands under distributional shift. To this end, we propose EmoLIME—the first LIME-based extension specifically designed for SER. Unlike conventional approaches, EmoLIME operates directly on high-dimensional speech embeddings (e.g., Wav2Vec 2.0) to localize the most influential local frequency subbands for emotion classification and generate high-level semantic attributions. Extensive experiments across three benchmark datasets demonstrate that EmoLIME exhibits strong cross-model robustness—surpassing its cross-dataset robustness—and consistently delivers reliable explanations for both handcrafted-feature-based and pretrained-embedding-based classifiers. Moreover, it significantly enhances decision transparency and trustworthiness of SER models. By bridging model-agnostic interpretability with speech-specific signal structure, EmoLIME establishes a novel paradigm for interpretable speech representation learning.

Technology Category

Application Category

📝 Abstract

We introduce EmoLIME, a version of local interpretable model-agnostic explanations (LIME) for black-box Speech Emotion Recognition (SER) models. To the best of our knowledge, this is the first attempt to apply LIME in SER. EmoLIME generates high-level interpretable explanations and identifies which specific frequency ranges are most influential in determining emotional states. The approach aids in interpreting complex, high-dimensional embeddings such as those generated by end-to-end speech models. We evaluate EmoLIME, qualitatively, quantitatively, and statistically, across three emotional speech datasets, using classifiers trained on both hand-crafted acoustic features and Wav2Vec 2.0 embeddings. We find that EmoLIME exhibits stronger robustness across different models than across datasets with distribution shifts, highlighting its potential for more consistent explanations in SER tasks within a dataset.

Problem

Research questions and friction points this paper is trying to address.

Interpret black-box Speech Emotion Recognition models

Identify influential frequency ranges for emotions

Improve explanation robustness across different datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applies LIME to Speech Emotion Recognition

Identifies influential frequency ranges for emotions

Evaluates robustness across models and datasets

🔎 Similar Papers

No similar papers found.