TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the inefficiency of manual extraction of time toxicity—defined as the cumulative number of days patients are exposed to medical interventions—from clinical trial protocols. To overcome this challenge, the authors propose an automated pipeline leveraging large language models, featuring a two-stage structured extraction architecture: first parsing the schedule of assessments and then computing time toxicity. The approach incorporates a multi-run consensus mechanism and a treatment arm alignment strategy to enhance robustness in real-world data. Evaluated on 644 real-world oncology trial protocols encompassing 1,288 treatment arms, the single-pass inference architecture using Google’s Gemini model achieved a clinically acceptable accuracy of 95.3% (interquartile range [IQR] ≤ 3 days), with 82.0% of cases exhibiting perfect stability (IQR = 0), substantially outperforming conventional methods reliant on synthetic data.

Technology Category

Application Category

📝 Abstract
Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.
Problem

Research questions and friction points this paper is trying to address.

time toxicity
clinical trial protocols
automated extraction
Schedule of Assessments
healthcare contact days
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based pipeline
time toxicity
clinical trial protocols
automated extraction
reproducibility
🔎 Similar Papers
No similar papers found.
S
Saketh Vinjamuri
Fairview Hospital, Cleveland Clinic Foundation, Cleveland, OH, USA
M
Marielle Fis Loperena
The George Washington University School of Medicine and Health Sciences, Washington, DC, USA
M
Marie C. Spezia
University of Missouri-Columbia School of Medicine, Columbia, MO, USA
Ramez Kouzy
Ramez Kouzy
MD Anderson Cancer Center
OncologyRadiation OncologyClinical TrialsArtificial IntelligenceDigital Health