🤖 AI Summary
Existing medical vision-language contrastive learning methods employ simplistic pooling for local matching, neglecting both semantic associations (e.g., clinical correspondences between disease terms and anatomical location terms) and token-level importance disparities (e.g., content words versus conjunctions). To address this, we propose the Relation-Enhanced Contrastive Learning Framework (RECLF), the first to explicitly model clinical semantic structures and weight distributions among image region–report keyword local matching pairs. RECLF introduces two novel modules: a Semantic Relation Modeling (SRM) module and an Importance Relation Modeling (IRM) module. By moving beyond conventional pooling that ignores relational structure, RECLF achieves state-of-the-art performance across six public benchmarks and four downstream tasks—zero-shot classification, linear probe evaluation, cross-modal retrieval, and segmentation—demonstrating substantial improvements in weakly supervised cross-modal alignment quality and representation generalizability.
📝 Abstract
Medical image representations can be learned through medical vision-language contrastive learning (mVLCL) where medical imaging reports are used as weak supervision through image-text alignment. These learned image representations can be transferred to and benefit various downstream medical vision tasks such as disease classification and segmentation. Recent mVLCL methods attempt to align image sub-regions and the report keywords as local-matchings. However, these methods aggregate all local-matchings via simple pooling operations while ignoring the inherent relations between them. These methods therefore fail to reason between local-matchings that are semantically related, e.g., local-matchings that correspond to the disease word and the location word (semantic-relations), and also fail to differentiate such clinically important local-matchings from others that correspond to less meaningful words, e.g., conjunction words (importance-relations). Hence, we propose a mVLCL method that models the inter-matching relations between local-matchings via a relation-enhanced contrastive learning framework (RECLF). In RECLF, we introduce a semantic-relation reasoning module (SRM) and an importance-relation reasoning module (IRM) to enable more fine-grained report supervision for image representation learning. We evaluated our method using six public benchmark datasets on four downstream tasks, including segmentation, zero-shot classification, linear classification, and cross-modal retrieval. Our results demonstrated the superiority of our RECLF over the state-of-the-art mVLCL methods with consistent improvements across single-modal and cross-modal tasks. These results suggest that our RECLF, by modelling the inter-matching relations, can learn improved medical image representations with better generalization capabilities.