Advancing Medical Representation Learning Through High-Quality Data

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Medical multimodal model performance is fundamentally constrained by data quality—not merely scale—yet this critical issue remains systematically unexplored. To address it, we introduce Open-PMC, a high-quality medical image-text dataset comprising 2.2 million aligned pairs, featuring novel clinical context modeling via fine-grained image annotations, subfigure segmentation, and integration of in-text reference abstracts. We provide the first empirical evidence that data quality exerts a stronger influence on representation learning than data scale. Furthermore, we propose a new paradigm for fine-grained semantic alignment grounded in reference literature. Evaluated on cross-modal retrieval and zero-shot classification, models trained on Open-PMC significantly outperform those trained on larger, lower-quality datasets. Feature analysis confirms that Open-PMC enables learning of more robust and clinically interpretable medical semantic representations.

Technology Category

Application Category

📝 Abstract

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

Problem

Research questions and friction points this paper is trying to address.

Impact of dataset quality on medical AI performance

Introduction of Open-PMC for enriched medical context

Benchmarking quality-driven gains in retrieval and classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality medical dataset Open-PMC introduced

Enriched with image-text pairs and annotations

Dataset quality drives significant performance gains

🔎 Similar Papers

No similar papers found.