CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech emotion recognition (SER) suffers from group bias due to spurious correlations between speaker characteristics and emotion labels. Existing debiasing methods often rely on sensitive attribute annotations or model-specific modifications, limiting generalizability. To address this, we propose a universal, plug-and-play debiasing framework that requires neither model architecture changes nor demographic annotations. First, bias-affected samples are identified unsupervisedly via confidence analysis. Second, voice conversion (VC) is employed to synthesize diverse speaker identities, thereby weakening spurious correlations. Third, adversarial data augmentation and bias pattern mining are integrated to further mitigate bias. This work introduces the novel paradigm of “confidence-guided, voice-enhanced debiasing”—a model-agnostic approach applicable to any SER system. Extensive experiments on multiple benchmark datasets demonstrate significant improvements in cross-group fairness (ΔEO ≤ 0.03) while preserving or even improving overall accuracy.

Technology Category

Application Category

📝 Abstract
Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
Problem

Research questions and friction points this paper is trying to address.

Mitigates bias in speech emotion recognition systems
Avoids model changes and demographic annotations
Enhances fairness using voice conversion augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses voice conversion to alter irrelevant attributes
Generates samples differing from dominant data patterns
Compatible with various SER models and tools
🔎 Similar Papers
No similar papers found.