🤖 AI Summary
Deep learning models for chest X-ray (CXR) interpretation exhibit uneven performance across demographic subgroups and produce overconfident, erroneous predictions—failures that standard evaluation metrics (e.g., AUROC) and existing error-detection methods fail to expose under in-distribution conditions. To address this, we propose a label-free, augmentation-sensitivity risk scoring framework: leveraging clinically plausible rotational augmentations (±15°/±30°) and the RAD-DINO encoder, we quantify embedding sensitivity to augmentation and define a stability quartile to flag high-risk predictions. This work is the first to jointly leverage representation consistency and augmentation sensitivity for reliability assessment in medical imaging. Our method uncovers overconfident failures masked by AUROC—reducing recall on sensitive samples by 0.2–0.3—enabling selective prediction and clinician review. It thus advances fairness and clinical safety in medical AI.
📝 Abstract
Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($pm 15^circ$/$pm 30^circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.