🤖 AI Summary
This study addresses the inefficiency of manual extraction of time toxicity—defined as the cumulative number of days patients are exposed to medical interventions—from clinical trial protocols. To overcome this challenge, the authors propose an automated pipeline leveraging large language models, featuring a two-stage structured extraction architecture: first parsing the schedule of assessments and then computing time toxicity. The approach incorporates a multi-run consensus mechanism and a treatment arm alignment strategy to enhance robustness in real-world data. Evaluated on 644 real-world oncology trial protocols encompassing 1,288 treatment arms, the single-pass inference architecture using Google’s Gemini model achieved a clinically acceptable accuracy of 95.3% (interquartile range [IQR] ≤ 3 days), with 82.0% of cases exhibiting perfect stability (IQR = 0), substantially outperforming conventional methods reliant on synthetic data.
📝 Abstract
Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.