🤖 AI Summary
This study systematically evaluates, for the first time, the generalizability of three publicly available deep learning models—UNet-R231, TotalSegmentator, and MedSAM—for lung segmentation in lung transplant candidates, with emphasis on disease severity, pathological subtypes (e.g., COPD, ILD, CF), and inter-lung asymmetry. Performance is assessed using Dice coefficient, volume similarity, Hausdorff distance, and a clinician-rated acceptability scale. Results show UNet-R231 achieves the best overall performance; however, all models exhibit significant degradation in moderate-to-severe cases—particularly in volume similarity (>15% reduction)—limiting their reliability for preoperative quantitative assessment. The key contribution is the empirical demonstration that current general-purpose models lack robustness to severe pulmonary structural deformation. This underscores the necessity of pathology- and severity-informed model optimization using high-quality, multi-pathology, pre-transplant CT data—providing critical evidence for clinically viable AI deployment in thoracic transplantation.
📝 Abstract
This study evaluates publicly available deep-learning based lung segmentation models in transplant-eligible patients to determine their performance across disease severity levels, pathology categories, and lung sides, and to identify limitations impacting their use in preoperative planning in lung transplantation. This retrospective study included 32 patients who underwent chest CT scans at Duke University Health System between 2017 and 2019 (total of 3,645 2D axial slices). Patients with standard axial CT scans were selected based on the presence of two or more lung pathologies of varying severity. Lung segmentation was performed using three previously developed deep learning models: Unet-R231, TotalSegmentator, MedSAM. Performance was assessed using quantitative metrics (volumetric similarity, Dice similarity coefficient, Hausdorff distance) and a qualitative measure (four-point clinical acceptability scale). Unet-R231 consistently outperformed TotalSegmentator and MedSAM in general, for different severity levels, and pathology categories (p<0.05). All models showed significant performance declines from mild to moderate-to-severe cases, particularly in volumetric similarity (p<0.05), without significant differences among lung sides or pathology types. Unet-R231 provided the most accurate automated lung segmentation among evaluated models with TotalSegmentator being a close second, though their performance declined significantly in moderate-to-severe cases, emphasizing the need for specialized model fine-tuning in severe pathology contexts.