ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ ability to reason about chemical toxicity based on biological mechanisms, often leading models to generate superficially plausible yet mechanistically incorrect explanations. This work introduces the first toxicity reasoning benchmark that integrates the Adverse Outcome Pathway (AOP) framework with experimental evidence from drug–target interaction studies, requiring models to infer organ-level adverse outcomes through a stepwise chain of molecular initiating events. By incorporating a reasoning-aware training approach, the study enables joint validation of both the model’s reasoning process and its final predictions, substantially improving the reliability of mechanistic reasoning and the accuracy of toxicity prediction. The findings demonstrate that strong predictive performance does not necessarily reflect correct mechanistic understanding.
📝 Abstract
Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biological mechanisms beyond chemical structure, necessitating mechanistic reasoning for reliable prediction. Despite its importance, current benchmarks fail to systematically evaluate this capability. LLMs can generate fluent but biologically unfaithful explanations, making it difficult to assess whether predicted toxicities are grounded invalid mechanisms. To bridge this gap, we introduce ToxReason, a benchmark grounded in the Adverse Outcome Pathway (AOP) that evaluates organ-level toxicity reasoning across multiple organs. ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO). Using ToxReason, we evaluate toxicity prediction performance and reasoning quality across diverse LLMs. We find that strong predictive performance does not necessarily imply reliable reasoning. Furthermore, we show that reasoning-aware training improves mechanistic reasoning and, consequently, toxicity prediction performance. Together, these results underscore the necessity of integrating reasoning into both evaluation and training for trustworthy toxicity modeling.
Problem

Research questions and friction points this paper is trying to address.

toxicity reasoning
mechanistic reasoning
Adverse Outcome Pathway
large language models
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

mechanistic reasoning
Adverse Outcome Pathway
toxicity prediction
large language models
benchmark
🔎 Similar Papers
No similar papers found.
Jueon Park
Jueon Park
Korea University
AI DrugDiscovery
W
Wonjune Jang
Myongji University
C
Chanhwi Kim
University of Texas Health Science Center at Houston
Yein Park
Yein Park
Korea University
NLPRAGKnowledge ConflictKnowledge Editing
J
Jaewoo Kang
Korea University, AIGEN Sciences