π€ AI Summary
This study addresses systematic gender bias in current emotion classification models, which exhibit significantly lower accuracy in recognizing emotions expressed by males. Leveraging a dataset of over one million self-annotated texts and employing a pre-registered experimental design, the research systematically evaluates gender disparities across 414 modelβemotion category combinations. It presents the first large-scale evidence from self-reported data demonstrating that emotion recognition models consistently underestimate male-expressed emotions and quantifies the potential downstream impact of this bias. Results reveal that error rates for male-authored texts are significantly higher than those for female-authored texts across all model types and emotion categories, underscoring a critical gap in demographic fairness within existing affective computing technologies.
π Abstract
The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.