BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing biomedical multimodal datasets are scarce and lack interpretability for non-expert users—such as middle-school students—hindering the deployment of vision-language models (VLMs) in education and clinical settings. Method: We introduce BMCA, the first large-scale, open biomedical image–text pair archive, comprising over 24 million semantically aligned image–caption pairs automatically extracted from PubMed Central. We propose an end-to-end, scientific-literature-driven multimodal data construction framework integrating OCR, document layout analysis, and expert-guided meta-annotation. We further release the streaming-pretrained BMCA-CLIP model family, eliminating the need to download 27 TB of raw data. Results: BMCA-CLIP achieves state-of-the-art performance across 40 cross-disciplinary biomedical tasks: zero-shot classification improves by +6.56% on average (+29.8% in dermatology, +17.5% in ophthalmology); image–text retrieval accuracy increases significantly; and computational overhead is reduced tenfold.

Technology Category

Application Category

📝 Abstract
The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.
Problem

Research questions and friction points this paper is trying to address.

Biomedical Dataset
Visual Language Models
Educational Content
Innovation

Methods, ideas, or system contributions that make the work stand out.

BIOMEDICA
BMCA-CLIP
Medical Image Recognition
🔎 Similar Papers
No similar papers found.
Alejandro Lozano
Alejandro Lozano
Stanford University
Foundation ModelsMultimodal LearningRetrieval Augmentation
Min Woo Sun
Min Woo Sun
Stanford University
Machine LearningMultimodal LearningStatistics
James Burgess
James Burgess
Stanford University
L
Liangyu Chen
Department of Computer Science, Stanford University
J
Jeffrey J. Nirschl
Department of Pathology, Stanford University
Jeffrey Gu
Jeffrey Gu
PhD candidate, Stanford University
Ivan Lopez
Ivan Lopez
Stanford University
data sciencemachine learningNLPhealth systemsclinical decision support
Josiah Aklilu
Josiah Aklilu
PhD student, Stanford University
Artificial IntelligenceComputer vision
Anita Rau
Anita Rau
Postdoc at Stanford University
Computer VisionMachine Learning
A
Austin Wolfgang Katzer
Department of Developmental Biology, Stanford University
Yuhui Zhang
Yuhui Zhang
Stanford University
Machine LearningComputer VisionNatural Language ProcessingBiotech
X
Xiaohan Wang
Department of Biomedical Data Science, Stanford University
R
R. Tibshirani
Department of Biomedical Data Science, Department of Statistics, Stanford University
S
S. Yeung-Levy
Department of Biomedical Data Science, Department of Electrical Engineering, Department of Computer Science, Stanford University