Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This study addresses the insufficient fine-grained attribution capability in explainable speech emotion recognition (SER). Methodologically, we propose an LLM-enhanced end-to-end framework: (1) leveraging HuBERT representations, we jointly model speech emotion descriptors (SEDs) via alternating multi-task fine-tuning (SER/SED/ASR), explicitly disentangling fine-grained acoustic attributes such as pitch and intonation; (2) we integrate an information bottleneck (IB)-guided variational autoencoder (VAE) to explicitly control the trade-off between feature granularity and interpretability. Our key contributions are the first unified optimization of LLM-driven SED modeling with SER, and the co-design of multi-task disentanglement and IB-VAE–based granularity regulation. On IEMOCAP and MELD, our method achieves absolute improvements of 4.0% and 3.7% in unweighted accuracy (relative gains of +5.4% and +6.6%), significantly outperforming single-mechanism baselines while providing verifiable, attribute-level emotional attributions.

Technology Category

Application Category

📝 Abstract

This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL representations via alternating LLM fine-tuning to joint SER-SED prediction and ASR tasks. VAE compressed HuBERT features derived via Information Bottleneck (IB) are used to adjust feature granularity. Experiments on the IEMOCAP and MELD benchmarks demonstrate that our approach consistently outperforms comparable LLaMA-based SER baselines, including those using either (a) alternating multi-task fine-tuning alone or (b) feature disentanglement only. Statistically significant increase of SER unweighted accuracy by up to 4.0% and 3.7% absolute (5.4% and 6.6% relative) are obtained. More importantly, emotion descriptors offer further explainability for SER.

Problem

Research questions and friction points this paper is trying to address.

Develops explainable speech emotion recognition using LLM-enhanced descriptors

Disentangles fine-grained speech features via multi-task LLM fine-tuning

Improves SER accuracy with interpretable emotion descriptor features

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-empowered fine-grained speech descriptors

Alternating LLM fine-tuning for SER-SED prediction

VAE compressed HuBERT features via Information Bottleneck

🔎 Similar Papers

Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction