🤖 AI Summary
This study addresses the significant gender bias present in multilingual, multimodal large speech models for emotion recognition and the lack of clarity regarding their fairness across languages and modalities. To tackle this, the authors construct a multilingual (English, Japanese, German) multimodal benchmark based on MELD-ST and propose ERM-MinMaxGAP, a novel approach that jointly optimizes emotion recognition performance and gender fairness. The method integrates empirical risk minimization with a MinMaxGAP regularization term and incorporates an adaptive fairness weighting mechanism. Experimental results demonstrate that the proposed approach improves emotion recognition accuracy by 5.5% and 5.0% in unimodal and multimodal settings, respectively, while simultaneously reducing gender bias gaps by 0.1% and 1.4%.
📝 Abstract
Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.