Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of speech emotion recognition, which is hindered by the scarcity of large-scale labeled datasets and the difficulty in efficiently extracting emotion-relevant features. The authors propose a lightweight approach based on the pre-trained Whisper encoder, presenting the first systematic evaluation of its intermediate layers for emotion recognition suitability. They further introduce two novel attention-based pooling mechanisms—multi-head attention average pooling and QKV pooling—that preserve critical emotional information while reducing dimensionality. Experimental results on the English IEMOCAP and Persian ShEMO datasets demonstrate that the proposed method achieves state-of-the-art performance on ShEMO, with a 2.47% absolute improvement in unweighted accuracy. Notably, the small Whisper variant matches the performance of HuBERT X-Large, underscoring the model’s cross-lingual effectiveness and computational efficiency.

Technology Category

Application Category

📝 Abstract
Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.
Problem

Research questions and friction points this paper is trying to address.

Speech Emotion Recognition
limited datasets
pre-trained models
representation extraction
dimensionality reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Whisper
Speech Emotion Recognition
Attentive Pooling
QKV Pooling
Pre-trained Representations
🔎 Similar Papers
No similar papers found.
A
Ali Shendabadi
Faculty of Intelligent Systems Engineering, University of Tehran, N.Kargar, Tehran, Iran.
P
Parnia Izadirad
Faculty of Intelligent Systems Engineering, University of Tehran, N.Kargar, Tehran, Iran.
Mostafa Salehi
Mostafa Salehi
Associate Professor, University of Tehran
Social Network and Media AnalysisNetwork Science
M
Mahmoud Bijankhan
Faculty of Literature and Humanities, University of Tehran, Enghelab, Tehran, Iran.