🤖 AI Summary
Natural speech emotion recognition (SER) faces fundamental challenges in real-world settings, including inherent speaker and contextual variability, diverse recording conditions, and severe class imbalance. To address these issues, this paper proposes a speech-text multimodal joint modeling framework. Our approach introduces a novel speech-large language model (SLLM)-driven cross-modal alignment mechanism that synergistically integrates self-supervised speech representations with large language model (LLM)-based textual understanding for fine-grained emotional cue extraction. Additionally, we design a class-weighted loss function and an ensemble-based majority voting decision strategy to mitigate class imbalance. Evaluated on the Interspeech Naturalistic SER Challenge, our method ranks 4th among 166 participating teams and achieves state-of-the-art (SOTA) performance after full training. The framework demonstrates significantly improved robustness and generalization capability in realistic, unconstrained environments.
📝 Abstract
Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.