🤖 AI Summary
Medical vision-language models (VLMs) often struggle to model the relationship between pathology (disease nature) and anatomy (lesion location) due to severe semantic entanglement between these two dimensions. To address this, we propose a pathology-anatomy dual-semantic disentangled two-stream pretraining framework: two separate Transformer streams independently encode disease categories and anatomical locations, while prototype-based contrastive learning and intra-image contrastive loss jointly model their structured interdependencies—enabling fine-grained cross-modal semantic alignment and interactive reasoning. This work introduces the first explicit semantic disentanglement mechanism for medical VLMs. Evaluated on four major chest X-ray benchmarks—NIH CXR14, RSNA Pneumonia, CheXpert, and MIMIC-CXR—the method achieves significant improvements in zero-shot transfer and downstream task performance, demonstrating strong generalization and scalability.
📝 Abstract
Pathology and anatomy are two essential groups of semantics in medical data. Pathology describes what the diseases are, while anatomy explains where the diseases occur. They describe diseases from different perspectives, providing complementary insights into diseases. Thus, properly understanding these semantics and their relationships can enhance medical vision-language models (VLMs). However, pathology and anatomy semantics are usually entangled in medical data, hindering VLMs from explicitly modeling these semantics and their relationships. To address this challenge, we propose MeDSLIP, a novel Medical Dual-Stream Language-Image Pre-training pipeline, to disentangle pathology and anatomy semantics and model the relationships between them. We introduce a dual-stream mechanism in MeDSLIP to explicitly disentangle medical semantics into pathology-relevant and anatomy-relevant streams and align visual and textual information within each stream. Furthermore, we propose an interaction modeling module with prototypical contrastive learning loss and intra-image contrastive learning loss to regularize the relationships between pathology and anatomy semantics. We apply MeDSLIP to chest X-ray analysis and conduct comprehensive evaluations with four benchmark datasets: NIH CXR14, RSNA Pneumonia, SIIM-ACR Pneumothorax, and COVIDx CXR-4. The results demonstrate MeDSLIP's superior generalizability and transferability across different scenarios. The code is available at https://github.com/Shef-AIRE/MeDSLIP, and the pre-trained model is released at https://huggingface.co/pykale/MeDSLIP.