🤖 AI Summary
Bladder cancer exhibits a high postoperative recurrence rate (78%), yet multiparametric contrast-enhanced MRI interpretation remains challenging due to post-surgical confounders—including fibrosis and edema—and the absence of dedicated, publicly available datasets for AI-driven recurrence prediction. To address these limitations, we propose a hierarchical gated-attention multi-branch network that jointly leverages CNN and Vision Transformer (ViT) pathways to model each MRI sequence independently, followed by context-aware dynamic weighting for global–local feature fusion. We introduce and publicly release the first multimodal MRI dataset specifically designed for bladder cancer recurrence prediction. Evaluated on our curated dataset, our model achieves an AUC of 78.6%, significantly outperforming existing methods. The source code and dataset are fully open-sourced. Moreover, the model provides clinically interpretable attention maps, demonstrating strong translational potential for clinical deployment.
📝 Abstract
Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT}.