A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Biomedical AI development is hindered by the scarcity of high-quality, large-scale, multimodal data. To address this, we introduce Biomedica—the first ultra-large-scale biomedical multimodal dataset constructed from the open-access PubMed Central literature corpus, comprising 6 million papers, 24 million image–text pairs, and 27 metadata fields (including expert-annotated labels). Methodologically, Biomedica integrates heterogeneous data mining with fine-grained image–text alignment, and features a scalable streaming API and retrieval service architecture designed to jointly train embedding models, dialogue models, and retrieval-augmented generation (RAG) agents. Experimental results demonstrate that models trained on Biomedica consistently outperform existing open-source baselines across embedding, dialogue understanding, and RAG tasks—validating the critical role of high-fidelity multimodal data in advancing general-purpose biomedical AI.

Technology Category

Application Category

📝 Abstract

Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality biomedical data for AI

Need for scalable access to large datasets

Improving AI models with diverse biomedical content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale open-source dataset from PubMed

Scalable streaming and search APIs

Advanced embedding and chat models

🔎 Similar Papers

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs