🤖 AI Summary
This work addresses the limited local interpretability of black-box models in speech emotion recognition (SER), particularly the difficulty in identifying discriminative frequency subbands under distributional shift. To this end, we propose EmoLIME—the first LIME-based extension specifically designed for SER. Unlike conventional approaches, EmoLIME operates directly on high-dimensional speech embeddings (e.g., Wav2Vec 2.0) to localize the most influential local frequency subbands for emotion classification and generate high-level semantic attributions. Extensive experiments across three benchmark datasets demonstrate that EmoLIME exhibits strong cross-model robustness—surpassing its cross-dataset robustness—and consistently delivers reliable explanations for both handcrafted-feature-based and pretrained-embedding-based classifiers. Moreover, it significantly enhances decision transparency and trustworthiness of SER models. By bridging model-agnostic interpretability with speech-specific signal structure, EmoLIME establishes a novel paradigm for interpretable speech representation learning.
📝 Abstract
We introduce EmoLIME, a version of local interpretable model-agnostic explanations (LIME) for black-box Speech Emotion Recognition (SER) models. To the best of our knowledge, this is the first attempt to apply LIME in SER. EmoLIME generates high-level interpretable explanations and identifies which specific frequency ranges are most influential in determining emotional states. The approach aids in interpreting complex, high-dimensional embeddings such as those generated by end-to-end speech models. We evaluate EmoLIME, qualitatively, quantitatively, and statistically, across three emotional speech datasets, using classifiers trained on both hand-crafted acoustic features and Wav2Vec 2.0 embeddings. We find that EmoLIME exhibits stronger robustness across different models than across datasets with distribution shifts, highlighting its potential for more consistent explanations in SER tasks within a dataset.