Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep learning models for chest X-ray (CXR) interpretation exhibit uneven performance across demographic subgroups and produce overconfident, erroneous predictions—failures that standard evaluation metrics (e.g., AUROC) and existing error-detection methods fail to expose under in-distribution conditions. To address this, we propose a label-free, augmentation-sensitivity risk scoring framework: leveraging clinically plausible rotational augmentations (±15°/±30°) and the RAD-DINO encoder, we quantify embedding sensitivity to augmentation and define a stability quartile to flag high-risk predictions. This work is the first to jointly leverage representation consistency and augmentation sensitivity for reliability assessment in medical imaging. Our method uncovers overconfident failures masked by AUROC—reducing recall on sensitive samples by 0.2–0.3—enabling selective prediction and clinician review. It thus advances fairness and clinical safety in medical AI.

Technology Category

Application Category

📝 Abstract
Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($pm 15^circ$/$pm 30^circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.
Problem

Research questions and friction points this paper is trying to address.

Detecting hidden failures in chest X-ray models across subgroups
Identifying overconfident errors within standard data distributions
Improving medical AI fairness through augmentation-sensitive risk scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses augmentation-sensitivity risk scoring framework
Measures embedding shifts with RAD-DINO encoder
Applies clinical rotations for error identification
🔎 Similar Papers
No similar papers found.
H
Han-Jay Shu
National Tsing Hua University, Department of Computer Science
W
Wei-Ning Chiu
National Taiwan University, Department of Computer Science and Information Engineering
S
Shun-Ting Chang
National Tsing Hua University, Department of Electrical Engineering and Computer Science
M
Meng-Ping Huang
National Tsing Hua University, Department of Technology Management
T
Takeshi Tohyama
Massachusetts Institute of Technology, Laboratory for Computational Physiology
A
Ahram Han
Massachusetts Institute of Technology, Laboratory for Computational Physiology
Po-Chih Kuo
Po-Chih Kuo
National Tsing Hua University
Machine learningMedical image analysisBiomedical signal processing