🤖 AI Summary
This study addresses the challenges of time-consuming and quality control–difficult clinical target volume (CTV) delineation in complex radiotherapy procedures such as total marrow and lymphoid irradiation (TMLI), where reliable identification of potential model errors is urgently needed. The authors propose a budget-aware, uncertainty-driven quality control framework built upon nnU-Net, which innovatively integrates temperature scaling with an efficient ensemble strategy to significantly improve uncertainty calibration and alignment with actual segmentation errors while preserving high segmentation accuracy. By generating voxel-wise uncertainty maps via predictive entropy and introducing region-of-interest (ROI) mask–calibrated metrics alongside top-uncertain-voxel AUC evaluation, the calibrated checkpoint ensemble effectively guides human review. The proposed pipeline demonstrates practical utility and reliability in TMLI applications.
📝 Abstract
Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty--error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.