TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets

📅 2024-06-30
🏛️ arXiv.org
📈 Citations: 19
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Clinical trials frequently incur substantial resource waste due to high failure rates and prolonged development timelines; AI adoption remains hindered by the absence of standardized multimodal data and systematically defined prediction tasks. To address this, we introduce the first AI-ready multimodal dataset specifically designed for clinical trial planning, integrating drug molecules (SMILES), disease codes (ICD), clinical text, and structured trial features. It supports eight critical prediction tasks—including trial duration, patient dropout rate, incidence of serious adverse events, and regulatory approval outcomes—spanning the entire trial lifecycle. We formally propose, validate via expert clinician annotation and task alignment, and benchmark a comprehensive AI prediction taxonomy. A unified evaluation protocol and baseline models (XGBoost, Transformer) are established. The dataset, evaluation metrics, and code are fully open-sourced, substantially lowering barriers for AI researchers and already enabling multiple studies on trial simulation and design optimization.

Technology Category

Application Category

📝 Abstract
Clinical trials are pivotal for developing new medical treatments, yet they typically pose some risks such as patient mortality, adverse events, and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to forecast or simulate key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise and a deep understanding of trial designs have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate, serious adverse event, mortality rate, trial approval outcome, trial failure reason, drug dose finding, design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets' usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development. The curated dataset, metrics, and basic models are publicly available at https://github.com/ML2Health/ML2ClinicalTrials/tree/main/AI4Trial.
Problem

Research questions and friction points this paper is trying to address.

Predicting key events in clinical trials using AI
Overcoming complex data collection and medical expertise barriers
Providing multi-modal datasets for clinical trial design challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal AI-ready datasets for clinical trials
Covers 8 key prediction challenges in trials
Includes validation methods for dataset reliability
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid
Jintai Chen
Jintai Chen
Assistant Professor@HKUST(GZ)
AI for HealthcareMultimodal LearningDeep Tabular Learning
Y
Yaojun Hu
College of Computer Science and Technology Zhejiang University, Hangzhou, China
Y
Yue Wang
College of Computer Science and Technology Zhejiang University, Hangzhou, China
Y
Yingzhou Lu
School of Medicine, Stanford University, Stanford, CA, USA
X
Xu Cao
Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA
M
Miao Lin
Medical Big Data Center, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China
Hongxia Xu
Hongxia Xu
Zhejiang University
AI4ScienceNanomedicineMedical imaging
J
Jian Wu
The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
C
Cao Xiao
GE HealthCare, Chicago, USA
Jimeng Sun
Jimeng Sun
Professor at University of Illinois Urbana-Champaign
AI for healthcareMachine learning for healthcaredeep learning for healthcare
Lucas Glass
Lucas Glass
IQVIA, Boston, USA
K
Kexin Huang
Computer Science Department, Stanford University, Stanford, CA, USA
M
M. Zitnik
Informatics, Harvard Medical School, Harvard University, USA
Tianfan Fu
Tianfan Fu
Nanjing University
AI for DrugAI for ScienceLarge Language Model