🤖 AI Summary
Existing fairness tools are often limited to single demographic attributes and struggle to capture the compounded biases faced by intersecting groups—such as combinations of race and gender—in clinical machine learning. This work proposes a Python toolkit that extends observational fairness metrics, including demographic parity and equalized odds, to intersectional subgroups for the first time, while integrating two counterfactual fairness frameworks to evaluate intervention-based equity. Applied to electronic health record data using logistic regression in a glaucoma surgery prediction task, the approach uncovers substantial intersectional unfairness, with a demographic parity gap as high as 0.20. Crucially, disparities identified through intersectional analysis markedly exceed those detected by single-dimension assessments, underscoring the necessity and efficacy of this method for auditing fairness in clinical algorithms.
📝 Abstract
Objective: Algorithmic fairness is essential for equitable and trustworthy machine learning in healthcare. Most fairness tools emphasize single-axis demographic comparisons and may miss compounded disparities affecting intersectional populations. This study introduces Fairlogue, a toolkit designed to operationalize intersectional fairness assessment in observational and counterfactual contexts within clinical settings. Methods: Fairlogue is a Python-based toolkit composed of three components: 1) an observational framework extending demographic parity, equalized odds, and equal opportunity difference to intersectional populations; 2) a counterfactual framework evaluating fairness under treatment-based contexts; and 3) a generalized counterfactual framework assessing fairness under interventions on intersectional group membership. The toolkit was evaluated using electronic health record data from the All of Us Controlled Tier V8 dataset in a glaucoma surgery prediction task using logistic regression with race and gender as protected attributes. Results: Observational analysis identified substantial intersectional disparities despite moderate model performance (AUROC = 0.709; accuracy = 0.651). Intersectional evaluation revealed larger fairness gaps than single-axis analyses, including demographic parity differences of 0.20 and equalized odds true positive and false positive rate gaps of 0.33 and 0.15, respectively. Counterfactual analysis using permutation-based null distributions produced unfairness ("u-value") estimates near zero, suggesting observed disparities were consistent with chance after conditioning on covariates. Conclusion: Fairlogue provides a modular toolkit integrating observational and counterfactual methods for quantifying and evaluating intersectional bias in clinical machine learning workflows.