🤖 AI Summary
This work addresses the reliability of clinical deep learning systems in detecting out-of-distribution (OOD) samples—such as unseen disease cases—by proposing an innovative dual-branch multimodal framework that, for the first time, integrates text–image matching with purely visual representations. The method generates OOD scores by synergistically fusing complementary information from both branches, substantially enhancing model robustness and generalization. Leveraging multimodal representation learning and a flexible scoring fusion strategy, the framework is compatible with various backbone architectures. Extensive experiments on multiple public endoscopic image datasets demonstrate that the proposed approach achieves up to a 24.84% improvement in OOD detection performance over current state-of-the-art methods, while maintaining consistent robustness across different backbones.
📝 Abstract
The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%