Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study addresses the limitations of traditional manually curated biomedical databases, which suffer from high maintenance costs, delayed updates, and insufficient experimental context to capture nuanced data variations. To overcome these challenges, the authors propose a fully automated framework leveraging large language models (LLMs) to construct a large-scale, structured biomedical knowledge dataset enriched with fine-grained contextual information from full-text PubMed articles. The approach integrates LLM-driven ontology-aligned entity annotation, hybrid sparse-dense retrieval, and Starling—a multi-agent deep research system—to dynamically optimize knowledge extraction and retrieval strategies. The resulting resource comprises approximately 6.3 million records across six tasks, several of which constitute the largest publicly available datasets of their kind. Evaluation with state-of-the-art models demonstrates significantly lower error rates compared to prominent human-curated databases.

📝 Abstract

Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.

Problem

Research questions and friction points this paper is trying to address.

biomedical knowledge

manual curation

experimental context

data accuracy

literature lag

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based entity tagging

hybrid sparse-dense retrieval

multi-agent deep research system