Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

πŸ“… 2025-06-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Clinical text data are severely restricted for public release due to privacy concerns, hindering biomedical NLP advancement. To address this, we introduce the first large-scale, commercially licensable clinical case dataset derived from PubMed. Our method employs a novel two-stage LLM-based annotation framework: (1) holistic quality and educational-value scoring of full texts, followed by domain-aware upsampling and copyright-compliant filtering; and (2) iterative refinement via LLM initial labeling, label propagation using compact models, and multi-dimensional filtering (quality, domain relevance, copyright compliance). This pipeline yields high-fidelity, ethically sourced clinical data. When used for continued pretraining of OLMo2, our dataset improves MMLU ProfMed accuracy by ~5%, MedQA/MedMCQA by ~1%, and accelerates training convergence threefold. The resulting resource establishes a robust, open, and efficient foundation for clinical NLP research.

Technology Category

Application Category

πŸ“ Abstract
We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed, making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens, indicating potential for more efficient and effective biomedical pretraining strategies.
Problem

Research questions and friction points this paper is trying to address.

Creating a biomedical dataset with LLM annotations for rare content
Providing open clinical case texts as an alternative to private records
Enhancing biomedical pretraining efficiency with quality-filtered subsets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage LLM annotation for biomedical text
Fine-tuned small model for label propagation
Quality filtering and domain upsampling variants
πŸ”Ž Similar Papers
No similar papers found.