🤖 AI Summary
This study addresses the critical need for reliable uncertainty-aware predictions in diabetic retinopathy screening, where models must recognize unreliable outputs and abstain to ensure clinical safety. It presents the first systematic investigation into how the duration of self-supervised pretraining influences model calibration and abstention capability. Under a fixed fine-tuning protocol, multiple pretrained checkpoints are evaluated for selective prediction performance—including coverage, selective accuracy, and macro F1 score. Results demonstrate that self-supervised pretraining substantially enhances reliability compared to training from scratch; however, once accuracy plateaus, further pretraining does not necessarily improve abstention behavior. The findings reveal pretraining duration as a key design factor governing model reliability—not merely a computational detail—and underscore the necessity of abstention-aware evaluation protocols in medical AI development.
📝 Abstract
Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.