PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This study addresses the critical scarcity of large-scale, high-quality image–text paired data in ophthalmic vision–language research by presenting the first fully automated pipeline to construct a hierarchical multimodal ophthalmology dataset from open-access PubMed Central literature PDFs. The pipeline extracts high-resolution figures, decomposes them into subfigures, classifies imaging modalities, identifies annotation markers, and leverages large language models to precisely segment figure captions down to the panel level. Evaluated on 102,023 image–text pairs, the method achieves state-of-the-art performance in caption segmentation (BLEU=0.913), panel detection (mAP@0.50=0.909), and image extraction (median IoU=0.997). The authors release the complete dataset, models, and reproducible processing pipeline, establishing a foundational infrastructure for future multimodal ophthalmic research.

📝 Abstract

Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a mAP@0.50 of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

ophthalmology

image-text dataset

scientific literature

data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

ophthalmology

figure decomposition