🤖 AI Summary
Acute polysubstance poisoning presents significant diagnostic challenges due to nonspecific symptoms and incomplete clinical information, necessitating the integration of unstructured on-scene narratives with structured vital sign data to improve diagnostic accuracy. This work proposes DeToxR, the first system to incorporate reinforcement learning into emergency toxicology decision support. It fine-tunes a large language model using Group Relative Policy Optimization (GRPO) and introduces a novel reward mechanism centered on multi-label consistency to directly optimize clinically relevant performance metrics. The approach effectively mitigates the model’s tendency to omit co-ingested substances or generate spurious predictions. Evaluated on a 14-class multi-label toxic substance classification task, DeToxR significantly outperforms both supervised baselines and the original model, achieving a clinically validated Micro-F1 score of 0.644—surpassing that of expert toxicologists (0.473)—and demonstrating strong potential for real-world clinical deployment.
📝 Abstract
Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model's reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.