🤖 AI Summary
To address the lack of high-quality foundational models for biomedical multimodal understanding, this paper introduces BiomedCLIP—the first large-scale, general-purpose biomedical vision-language pretraining model trained on PMC-15M, a newly curated dataset of 15 million automatically aligned, highly diverse image–text pairs. Methodologically, we adapt the CLIP architecture with domain-specialized Vision Transformers (ViTs) for visual encoding and a tailored text encoder, augmented by a biomedical-aware contrastive learning objective. Our contributions are threefold: (1) the first successful large-scale unified representation learning for biomedical images and text; (2) new state-of-the-art results across diverse benchmarks—including image retrieval, classification, and visual question answering—and superior performance over domain-specific models (e.g., BioViL) on clinical tasks such as RSNA pneumonia detection; and (3) full open-sourcing of both the model and the PMC-15M dataset. This work transcends modality-specific paradigms, establishing a scalable foundational architecture for cross-modal biomedical AI.
📝 Abstract
Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.