Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This study addresses the limitations of traditional manually curated biomedical databases, which suffer from high maintenance costs, delayed updates, and insufficient experimental context to capture nuanced data variations. To overcome these challenges, the authors propose a fully automated framework leveraging large language models (LLMs) to construct a large-scale, structured biomedical knowledge dataset enriched with fine-grained contextual information from full-text PubMed articles. The approach integrates LLM-driven ontology-aligned entity annotation, hybrid sparse-dense retrieval, and Starling—a multi-agent deep research system—to dynamically optimize knowledge extraction and retrieval strategies. The resulting resource comprises approximately 6.3 million records across six tasks, several of which constitute the largest publicly available datasets of their kind. Evaluation with state-of-the-art models demonstrates significantly lower error rates compared to prominent human-curated databases.
📝 Abstract
Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.
Problem

Research questions and friction points this paper is trying to address.

biomedical knowledge
manual curation
experimental context
data accuracy
literature lag
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based entity tagging
hybrid sparse-dense retrieval
multi-agent deep research system
nuance-rich structured data
AI-driven therapeutic design
Haydn Jones
Haydn Jones
University of Pennsylvania
Machine Learning
Yimeng Zeng
Yimeng Zeng
PhD Student, University of Pennsylvania
Machine LearningBayesian OptimizationGenerative ModelsLarge Language Models
A
Alden Rose
Department of Computer and Information Science, University of Pennsylvania
L
Li S. Yifei
Department of Computer and Information Science, University of Pennsylvania
Y
Yining Huang
Department of Computer and Information Science, University of Pennsylvania
Kaiwen Wu
Kaiwen Wu
University of Pennsylvania
machine learningoptimization
J
Jiaming Liang
Department of Computer and Information Science, University of Pennsylvania
M
Maggie Ziyu Huan
Department of Computer and Information Science, University of Pennsylvania
Y
Yoseph Barash
Department of Genetics, University of Pennsylvania
C
Cesar de la Fuente-Nunez
Departments of Bioengineering and Chemical and Biomolecular Engineering, University of Pennsylvania
Osbert Bastani
Osbert Bastani
University of Pennsylvania
Machine LearningArtificial IntelligenceProgramming LanguagesSecurity
Z
Zachary Ives
Department of Computer and Information Science, University of Pennsylvania
Mark Yatskar
Mark Yatskar
University of Pennsylvania
Language and VisionNatural Language ProcessingComputer VisionFairness in AIMachine Learning
J
Jacob R. Gardner
Department of Computer and Information Science, University of Pennsylvania