ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Natural speech emotion recognition (SER) faces fundamental challenges in real-world settings, including inherent speaker and contextual variability, diverse recording conditions, and severe class imbalance. To address these issues, this paper proposes a speech-text multimodal joint modeling framework. Our approach introduces a novel speech-large language model (SLLM)-driven cross-modal alignment mechanism that synergistically integrates self-supervised speech representations with large language model (LLM)-based textual understanding for fine-grained emotional cue extraction. Additionally, we design a class-weighted loss function and an ensemble-based majority voting decision strategy to mitigate class imbalance. Evaluated on the Interspeech Naturalistic SER Challenge, our method ranks 4th among 166 participating teams and achieves state-of-the-art (SOTA) performance after full training. The framework demonstrates significantly improved robustness and generalization capability in realistic, unconstrained environments.

Technology Category

Application Category

📝 Abstract
Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.
Problem

Research questions and friction points this paper is trying to address.

Speech emotion recognition in naturalistic settings is challenging due to variability and class imbalance.
The paper presents Abhinaya, integrating speech, text, and speech-text models for SER.
The system addresses class imbalance with tailored loss functions and majority voting.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes self-supervised and speech LLMs
Leverages LLMs for textual context
Employs speech-text modeling with SLLM
🔎 Similar Papers
No similar papers found.
Soumya Dutta
Soumya Dutta
Assistant Professor of Computer Science at IIT Kanpur
Machine LearningVisual ComputingxAIData ScienceHPC
S
Smruthi Balaji
Shiv Nadar University, Chennai, India
R
R. Varada
Learning and Extraction of Acoustic Patterns (LEAP) Lab, Electrical Engineering, Indian Institute of Science, Bangalore, India
V
Viveka Salinamakki
Learning and Extraction of Acoustic Patterns (LEAP) Lab, Electrical Engineering, Indian Institute of Science, Bangalore, India
S
Sriram Ganapathy
Learning and Extraction of Acoustic Patterns (LEAP) Lab, Electrical Engineering, Indian Institute of Science, Bangalore, India