🤖 AI Summary
This study addresses how social identity markers—such as sexual orientation and religious affiliation—adversely impact the accuracy and confidence calibration of large language models (LLMs) in medical question answering, thereby threatening the fairness and safety of clinical deployment. The authors construct a medical QA dataset annotated with social identity attributes and their counterfactual variants to systematically evaluate nine general-purpose and biomedical LLMs. They reveal, for the first time, that sexual orientation and religious identity exert non-additive, intersecting negative effects on model calibration, with these biases persisting significantly in open-ended generation settings. Through counterfactual data augmentation, uncertainty calibration analysis, multi-model comparison, and clinical expert validation, the study demonstrates that labeling patients as “gay” substantially degrades model performance, while intersecting identities induce severe calibration shifts, highlighting critical risks in confidence-based clinical decision-making.
📝 Abstract
Safe clinical deployment of Large Language Models (LLMs) requires not only high accuracy but also robust uncertainty calibration to ensure models defer to clinicians when appropriate. Our paper investigates how social descriptors of a patient (specifically sexual orientation and religious affiliation) distort these uncertainty signals and model accuracy. Evaluating nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants, we demonstrate that identity markers cause a "calibration crisis". "Homosexual" markers consistently trigger performance drops, and intersectional identities produce idiosyncratic, non-additive harms to calibration. Moreover, a clinician-validated case study in an open-ended generation setting confirms that these failures are not an artifact of the multiple-choice format. Our results demonstrate that the presence of social identity cues does not merely shift predictions; it affects the reliability of confidence signals, posing a significant risk to equitable care and safe deployment in confidence-based clinical workflows.