🤖 AI Summary
Large language models (LLMs) exhibit significant gender, age, and racial biases in ICU mortality prediction, and existing debiasing methods often degrade predictive accuracy. Method: We propose CAP (Case-aware Debiasing Prompting), a fine-tuning-free, clinically adaptable prompting framework that integrates bias-mitigating prompts with misprediction-case-guided reasoning, leveraging structured MIMIC-IV data to steer model calibration. Contribution/Results: Evaluated via multidimensional fairness metrics and attention consistency analysis, CAP achieves an AUROC of 0.873 (+6.7% absolute gain) and AUPRC of 0.694 (+19.7%), while reducing performance disparities across gender and racial groups by over 90%. Cross-group attention similarity exceeds 0.98, marking the first demonstration of simultaneous improvement in both fairness and accuracy for LLM-based clinical risk prediction.
📝 Abstract
Accurate mortality risk prediction for intensive care unit (ICU) patients is essential for clinical decision-making. Although large language models (LLMs) show promise in predicting outcomes from structured medical data, their predictions may exhibit demographic biases related to sex, age, and race, limiting their trustworthy use in clinical practice. Existing debiasing methods often reduce predictive performance, making it difficult to jointly optimize fairness and accuracy. In this study, we systematically examine bias in LLM-based ICU mortality prediction and propose a training-free, clinically adaptive prompting framework to simultaneously improve fairness and performance. We first develop a multi-dimensional bias assessment scheme for comprehensive model diagnosis. Building on this analysis, we introduce CAse Prompting (CAP), a novel prompting framework that integrates conventional debiasing prompts with case-based reasoning. CAP guides the model to learn from similar historical misprediction cases and their correct outcomes, enabling correction of biased reasoning patterns. Experiments on the MIMIC-IV dataset show that CAP substantially improves both predictive accuracy and fairness. CAP increases AUROC from 0.806 to 0.873 and AUPRC from 0.497 to 0.694, while reducing sex- and race-related disparities by over 90%. Feature reliance analysis further indicates highly consistent attention patterns across demographic groups, with similarity scores exceeding 0.98. These results demonstrate that LLMs exhibit measurable bias in ICU mortality prediction, and that a carefully designed prompting framework can effectively co-optimize fairness and performance without retraining, offering a transferable paradigm for equitable clinical decision support.