Automatically Labeling $200B Life-Saving Datasets: A Large Clinical Trial Outcome Benchmark

📅 2024-06-13

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the challenges of automatic, timely labeling of large-scale clinical trial outcomes to improve predictive model performance and accelerate drug development. We construct the CTO knowledge base covering 125K trials and propose the first multimodal, time-series dynamic label generation framework integrating news sentiment, stock price volatility, and literature semantics. We design a continuous-labeling paradigm aligned with the pharmaceutical R&D lifecycle and quantitatively characterize distributional shift in trial data from 2020–2024—the first such empirical analysis. Leveraging LLM-based semantic parsing, cross-temporal trial alignment, and an expert-curated annotation protocol, our method achieves 94% F1-score on Phase 3 trial labels and 91% average F1 across all phases. We publicly release a fully reproducible, updatable knowledge base and annotation dataset (https://chufangao.github.io/CTOD), enabling real-time updates and serving as a benchmark for downstream evaluation.

Technology Category

Application Category

📝 Abstract

Background: The global cost of drug discovery and development exceeds $200 billion annually, with clinical trial outcomes playing a critical role in the regulatory approval of new drugs and impacting patient outcomes. Despite their significance, large-scale, high-quality clinical trial outcome data are not readily available to the public, limiting advances in trial outcome predictive modeling. Methods: We introduce the Clinical Trial Outcome (CTO) knowledge base, a fully reproducible, large-scale (around 125K drug and biologics trials), open-source of clinical trial information including large language model (LLM) interpretations of publications, matched trials over phases, sentiment analysis from news, stock prices of trial sponsors, and other trial-related metrics. From this knowledge base, we additionally performed manual annotation of a set of recent clinical trials from 2020-2024. Results: We evaluated the quality of our knowledge base by generating high-quality trial outcome labels that demonstrate strong agreement with previously published expert annotations, achieving an F1 score of 94 for Phase 3 trials and 91 across all phases. Additionally, we benchmarked a suite of standard machine learning models on our manually annotated set, highlighting the distribution shift of recent trials and the need for continuously updated labeling methods. Conclusions: By analyzing CTO's performance on recent trials, we showed a need for recent, high-quality trial outcome labels. We release our knowledge base and labels to the public at https://chufangao.github.io/CTOD, which will also be regularly updated to support ongoing research in clinical trial outcomes, offering insights that could optimize the drug development process.

Problem

Research questions and friction points this paper is trying to address.

Automated labeling of clinical trial outcomes

High-quality public clinical trial data

Continuous updates for trial outcome labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale clinical trial database

LLM interpretations of publications

Manual annotation of recent trials

🔎 Similar Papers

No similar papers found.