ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Arabic speech emotion recognition (SER) has long suffered from data scarcity and high computational overhead of existing models. To address these challenges, this paper proposes a lightweight hybrid architecture that takes Mel-spectrogram images as input and integrates 2D convolutional neural networks (CNNs), bidirectional long short-term memory (BiLSTM) networks, and self-attention mechanisms—deliberately omitting conventional MFCC features and 1D convolutions to enhance fine-grained spatio-temporal emotional feature modeling. The resulting model contains only ~1 million parameters—approximately 1/90 the size of HuBERT-base and 1/74 that of Whisper—enabling efficient deployment on resource-constrained devices. Evaluated on major Arabic SER benchmarks, it achieves state-of-the-art (SOTA) accuracy. This work thus establishes a highly accurate yet computationally efficient framework for SER in low-resource languages, offering a practical and scalable solution for real-world deployment.

Technology Category

Application Category

📝 Abstract
Speech emotion recognition is vital for human-computer interaction, particularly for low-resource languages like Arabic, which face challenges due to limited data and research. We introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and deliver state-of-the-art performance. Unlike previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced spectro-temporal patterns, ArabEmoNet uses Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often lost in traditional methods. While recent models favor large-scale architectures with millions of parameters, ArabEmoNet achieves superior results with just 1 million parameters, 90 times smaller than HuBERT base and 74 times smaller than Whisper. This efficiency makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech emotion recognition, offering exceptional performance and accessibility for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Addressing Arabic speech emotion recognition with limited data and resources
Overcoming loss of nuanced emotional cues in traditional feature extraction
Providing efficient model for resource-constrained environments compared to large architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid 2D CNN-BiLSTM with attention mechanism
Uses Mel spectrograms instead of MFCC features
Lightweight model with only 1 million parameters
🔎 Similar Papers
No similar papers found.
Ali Abouzeid
Ali Abouzeid
Msc @MBZUAI, B.Eng @UTM JB
Robot Learning3D Vision
B
Bilal Elbouardi
Mohamed bin Zayed University of Artificial Intelligence, University of Waterloo
M
Mohamed Maged
Mohamed bin Zayed University of Artificial Intelligence, University of Waterloo
Shady Shehata
Shady Shehata
University of Waterloo
Artificial IntelligenceNatural Language Processing