Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the scarcity of large-scale, high-fidelity biomedical image-text alignment data, this paper introduces the first large-scale, automated subfigure parsing paradigm for composite medical images. Leveraging a scalable Transformer-based object detection pipeline, it precisely extracts subfigures and their corresponding clinical captions from 18 million multimodal (radiological, microscopic, visible-light) medical images. By integrating structured parsing of PMC literature, synthetic data augmentation, and high-precision image-text alignment cleaning, we construct the first cross-modal, unified, high-quality biomedical image-text dataset. This dataset fills a critical gap in the field and achieves state-of-the-art performance on benchmarks including ImageCLEF 2016. Vision-language models trained on it significantly outperform existing methods across cross-modal retrieval, zero-shot classification, and robustness evaluation.

Technology Category

Application Category

📝 Abstract

Compound figures, which are multi-panel composites containing diverse subfigures, are ubiquitous in biomedical literature, yet large-scale subfigure extraction remains largely unaddressed. Prior work on subfigure extraction has been limited in both dataset size and generalizability, leaving a critical open question: How does high-fidelity image-text alignment via large-scale subfigure extraction impact representation learning in vision-language models? We address this gap by introducing a scalable subfigure extraction pipeline based on transformer-based object detection, trained on a synthetic corpus of 500,000 compound figures, and achieving state-of-the-art performance on both ImageCLEF 2016 and synthetic benchmarks. Using this pipeline, we release OPEN-PMC-18M, a large-scale high quality biomedical vision-language dataset comprising 18 million clinically relevant subfigure-caption pairs spanning radiology, microscopy, and visible light photography. We train and evaluate vision-language models on our curated datasets and show improved performance across retrieval, zero-shot classification, and robustness benchmarks, outperforming existing baselines. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

Problem

Research questions and friction points this paper is trying to address.

Large-scale subfigure extraction in biomedical literature

Impact of image-text alignment on vision-language models

Lack of high-quality biomedical vision-language datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based object detection for subfigure extraction

Synthetic corpus training with 500,000 compound figures

18M high-quality biomedical image-text pairs dataset

🔎 Similar Papers

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs