🤖 AI Summary
This paper investigates prediction fairness of deep homogeneous ensemble models under multi-class settings with subgroup heterogeneity—characterized by imbalanced task difficulty and sample size across subgroups. We propose a fairness analysis framework based on subgroup-wise performance decomposition, integrating fairness metrics such as equal opportunity difference and accuracy gap, and conduct controlled experiments to isolate the impact of ensemble mechanisms on the fairness–accuracy trade-off. Our key contributions are threefold: (1) homogeneous ensembling significantly mitigates worst-subgroup accuracy gaps without compromising overall accuracy; (2) task difficulty and data imbalance exhibit a nontrivial interaction effect, challenging the common “balance-is-better” assumption; and (3) excessive balancing can exacerbate fairness disparities. Evaluated on multiple benchmark datasets, our approach reduces subgroup accuracy gaps by 37% on average, while maintaining or slightly improving overall accuracy.
📝 Abstract
Ensembling is commonly regarded as an effective way to improve the general performance of models in machine learning, while also increasing the robustness of predictions. When it comes to algorithmic fairness, heterogeneous ensembles, composed of multiple model types, have been employed to mitigate biases in terms of demographic attributes such as sex, age or ethnicity. Moreover, recent work has shown how in multi-class problems even simple homogeneous ensembles may favor performance of the worst-performing target classes. While homogeneous ensembles are simpler to implement in practice, it is not yet clear whether their benefits translate to groups defined not in terms of their target class, but in terms of demographic or protected attributes, hence improving fairness. In this work we show how this simple and straightforward method is indeed able to mitigate disparities, particularly benefiting under-performing subgroups. Interestingly, this can be achieved without sacrificing overall performance, which is a common trade-off observed in bias mitigation strategies. Moreover, we analyzed the interplay between two factors which may result in biases: sub-group under-representation and the inherent difficulty of the task for each group. These results revealed that, contrary to popular assumptions, having balanced datasets may be suboptimal if the task difficulty varies between subgroups. Indeed, we found that a perfectly balanced dataset may hurt both the overall performance and the gap between groups. This highlights the importance of considering the interaction between multiple forces at play in fairness.