Case Prompting to Mitigate Large Language Model Bias for ICU Mortality Prediction

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant gender, age, and racial biases in ICU mortality prediction, and existing debiasing methods often degrade predictive accuracy. Method: We propose CAP (Case-aware Debiasing Prompting), a fine-tuning-free, clinically adaptable prompting framework that integrates bias-mitigating prompts with misprediction-case-guided reasoning, leveraging structured MIMIC-IV data to steer model calibration. Contribution/Results: Evaluated via multidimensional fairness metrics and attention consistency analysis, CAP achieves an AUROC of 0.873 (+6.7% absolute gain) and AUPRC of 0.694 (+19.7%), while reducing performance disparities across gender and racial groups by over 90%. Cross-group attention similarity exceeds 0.98, marking the first demonstration of simultaneous improvement in both fairness and accuracy for LLM-based clinical risk prediction.

Technology Category

Application Category

📝 Abstract

Accurate mortality risk prediction for intensive care unit (ICU) patients is essential for clinical decision-making. Although large language models (LLMs) show promise in predicting outcomes from structured medical data, their predictions may exhibit demographic biases related to sex, age, and race, limiting their trustworthy use in clinical practice. Existing debiasing methods often reduce predictive performance, making it difficult to jointly optimize fairness and accuracy. In this study, we systematically examine bias in LLM-based ICU mortality prediction and propose a training-free, clinically adaptive prompting framework to simultaneously improve fairness and performance. We first develop a multi-dimensional bias assessment scheme for comprehensive model diagnosis. Building on this analysis, we introduce CAse Prompting (CAP), a novel prompting framework that integrates conventional debiasing prompts with case-based reasoning. CAP guides the model to learn from similar historical misprediction cases and their correct outcomes, enabling correction of biased reasoning patterns. Experiments on the MIMIC-IV dataset show that CAP substantially improves both predictive accuracy and fairness. CAP increases AUROC from 0.806 to 0.873 and AUPRC from 0.497 to 0.694, while reducing sex- and race-related disparities by over 90%. Feature reliance analysis further indicates highly consistent attention patterns across demographic groups, with similarity scores exceeding 0.98. These results demonstrate that LLMs exhibit measurable bias in ICU mortality prediction, and that a carefully designed prompting framework can effectively co-optimize fairness and performance without retraining, offering a transferable paradigm for equitable clinical decision support.

Problem

Research questions and friction points this paper is trying to address.

Mitigate demographic bias in LLM-based ICU mortality predictions

Improve fairness and accuracy simultaneously without model retraining

Develop a prompting framework for equitable clinical decision support

Innovation

Methods, ideas, or system contributions that make the work stand out.

Case prompting framework integrates debiasing prompts with case-based reasoning

Training-free method improves fairness and accuracy without model retraining

Guides model to learn from similar historical misprediction cases

🔎 Similar Papers

No similar papers found.