🤖 AI Summary
For rectal cancer patients achieving clinical complete response (cCR) after neoadjuvant therapy, the watch-and-wait (WW) strategy is increasingly adopted; however, early, objective, and accurate detection of local recurrence (LR) during endoscopic surveillance remains a critical unmet need.
Method: We propose a registration-free dual-phase endoscopic image analysis framework: a Siamese network built upon a pretrained Swin Transformer, augmented with a novel dual cross-attention mechanism to enhance inter-phase feature interaction, and integrated with longitudinal contrastive learning.
Contribution/Results: Evaluated on 62 patient cases, our model achieves 81.76% balanced accuracy, 90.07% sensitivity, and 72.86% specificity for LR detection. Feature clustering demonstrates strong discriminative capability, and the model exhibits robustness against common endoscopic artifacts. The approach delivers interpretable, highly robust AI-assisted decision support for precise dynamic monitoring in WW management.
📝 Abstract
Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76% $pm$ 0.04), sensitivity (90.07% $pm$ 0.08), and specificity (72.86% $pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $pm$ 0.18) and minimal intra-cluster dispersion (1.07 $pm$ 0.19) with SSDCA, confirming discriminative representation learning.