🤖 AI Summary
Biomedical AI development is hindered by the scarcity of high-quality, large-scale, multimodal data. To address this, we introduce Biomedica—the first ultra-large-scale biomedical multimodal dataset constructed from the open-access PubMed Central literature corpus, comprising 6 million papers, 24 million image–text pairs, and 27 metadata fields (including expert-annotated labels). Methodologically, Biomedica integrates heterogeneous data mining with fine-grained image–text alignment, and features a scalable streaming API and retrieval service architecture designed to jointly train embedding models, dialogue models, and retrieval-augmented generation (RAG) agents. Experimental results demonstrate that models trained on Biomedica consistently outperform existing open-source baselines across embedding, dialogue understanding, and RAG tasks—validating the critical role of high-fidelity multimodal data in advancing general-purpose biomedical AI.
📝 Abstract
Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.