๐ค AI Summary
Clinical trial AI development suffers from a lack of large-scale, structured, and ontology-aligned data. Method: We constructed a structured database covering 1.65 million trials, integrating 15 global sources, and achieved the first full-lifecycle ontology alignment of clinical trialsโmapped to UMLS, DrugBank, and MedDRA. We further introduced CT-Bench, the first standardized benchmark for evidence-based medicine, comprising eight high-clinical-relevance tasks. Our methodology integrates heterogeneous multi-source data, fine-grained ontology mapping, and an LLM-based zero-shot evaluation framework. Contribution/Results: We publicly release both the database and CT-Bench. Extensive experiments on five state-of-the-art LLMs reveal significant performance gaps on critical tasks, underscoring the necessity of domain-specific modeling and filling a fundamental gap in AI evaluation for clinical trials.
๐ Abstract
Developing artificial intelligence (AI) for vertical domains requires a solid data foundation for both training and evaluation. In this work, we introduce TrialPanorama, a large-scale, structured database comprising 1,657,476 clinical trial records aggregated from 15 global sources. The database captures key aspects of trial design and execution, including trial setups, interventions, conditions, biomarkers, and outcomes, and links them to standard biomedical ontologies such as DrugBank and MedDRA. This structured and ontology-grounded design enables TrialPanorama to serve as a unified, extensible resource for a wide range of clinical trial tasks, including trial planning, design, and summarization. To demonstrate its utility, we derive a suite of benchmark tasks directly from the TrialPanorama database. The benchmark spans eight tasks across two categories: three for systematic review (study search, study screening, and evidence summarization) and five for trial design (arm design, eligibility criteria, endpoint selection, sample size estimation, and trial completion assessment). The experiments using five state-of-the-art large language models (LLMs) show that while general-purpose LLMs exhibit some zero-shot capability, their performance is still inadequate for high-stakes clinical trial workflows. We release TrialPanorama database and the benchmark to facilitate further research on AI for clinical trials.