Measuring AI Progress in Drug Discovery: A Reproducible Leaderboard for the Tox21 Challenge

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Over the past decade, the original Tox21 challenge dataset has undergone repeated modifications, label imputation, and re-annotation, undermining reproducibility and hindering fair comparison and progress assessment of toxicity prediction models. Method: We reconstruct a fully reproducible, standardized benchmark grounded exclusively in the original Tox21 data. This includes (i) the first version-controlled, publicly accessible leaderboard on Hugging Face; (ii) a unified API and standardized preprocessing pipeline; and (iii) systematic evaluation—within the MoleculeNet framework—of deep neural networks (e.g., DeepTox), ensemble methods, and descriptor-based self-normalizing models. Contribution/Results: Our evaluation reveals that early state-of-the-art models—particularly DeepTox—remain significantly superior to many recent approaches, indicating limited substantive advances in toxicity prediction over the last ten years. The primary contribution is the establishment of the first open, reproducible, and standardized Tox21 benchmark, providing a rigorous foundation for evaluating AI-driven toxicological modeling.

Technology Category

Application Category

📝 Abstract
Deep learning's rise since the early 2010s has transformed fields like computer vision and natural language processing and strongly influenced biomedical research. For drug discovery specifically, a key inflection - akin to vision's"ImageNet moment"- arrived in 2015, when deep neural networks surpassed traditional approaches on the Tox21 Data Challenge. This milestone accelerated the adoption of deep learning across the pharmaceutical industry, and today most major companies have integrated these methods into their research pipelines. After the Tox21 Challenge concluded, its dataset was included in several established benchmarks, such as MoleculeNet and the Open Graph Benchmark. However, during these integrations, the dataset was altered and labels were imputed or manufactured, resulting in a loss of comparability across studies. Consequently, the extent to which bioactivity and toxicity prediction methods have improved over the past decade remains unclear. To this end, we introduce a reproducible leaderboard, hosted on Hugging Face with the original Tox21 Challenge dataset, together with a set of baseline and representative methods. The current version of the leaderboard indicates that the original Tox21 winner - the ensemble-based DeepTox method - and the descriptor-based self-normalizing neural networks introduced in 2017, continue to perform competitively and rank among the top methods for toxicity prediction, leaving it unclear whether substantial progress in toxicity prediction has been achieved over the past decade. As part of this work, we make all baselines and evaluated models publicly accessible for inference via standardized API calls to Hugging Face Spaces.
Problem

Research questions and friction points this paper is trying to address.

Measuring AI progress in toxicity prediction over past decade
Addressing dataset alterations that hinder comparability across studies
Determining if substantial improvements occurred since Tox21 benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reproducible leaderboard using original Tox21 dataset
Standardized API calls via Hugging Face Spaces
Public baselines and models for toxicity prediction
🔎 Similar Papers
No similar papers found.
A
Antonia Ebner
ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria
C
Christoph Bartmann
ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria
S
Sonja Topf
ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria
S
Sohvi Luukkonen
ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria
Johannes Schimunek
Johannes Schimunek
Johannes Kepler University Linz - ELLIS Unit at the LIT AI Lab
Machine Learning
G
G. Klambauer
ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria; Clinical Research Institute for Medical AI, Johannes Kepler University, Linz, Austria