Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Deep learning models for chest X-ray (CXR) interpretation exhibit uneven performance across demographic subgroups and produce overconfident, erroneous predictions—failures that standard evaluation metrics (e.g., AUROC) and existing error-detection methods fail to expose under in-distribution conditions. To address this, we propose a label-free, augmentation-sensitivity risk scoring framework: leveraging clinically plausible rotational augmentations (±15°/±30°) and the RAD-DINO encoder, we quantify embedding sensitivity to augmentation and define a stability quartile to flag high-risk predictions. This work is the first to jointly leverage representation consistency and augmentation sensitivity for reliability assessment in medical imaging. Our method uncovers overconfident failures masked by AUROC—reducing recall on sensitive samples by 0.2–0.3—enabling selective prediction and clinician review. It thus advances fairness and clinical safety in medical AI.

Technology Category

Application Category

📝 Abstract

Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($pm 15^circ$/$pm 30^circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

Problem

Research questions and friction points this paper is trying to address.

Detecting hidden failures in chest X-ray models across subgroups

Identifying overconfident errors within standard data distributions

Improving medical AI fairness through augmentation-sensitive risk scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses augmentation-sensitivity risk scoring framework

Measures embedding shifts with RAD-DINO encoder

Applies clinical rotations for error identification

🔎 Similar Papers

No similar papers found.