🤖 AI Summary
Over the past decade, the original Tox21 challenge dataset has undergone repeated modifications, label imputation, and re-annotation, undermining reproducibility and hindering fair comparison and progress assessment of toxicity prediction models.
Method: We reconstruct a fully reproducible, standardized benchmark grounded exclusively in the original Tox21 data. This includes (i) the first version-controlled, publicly accessible leaderboard on Hugging Face; (ii) a unified API and standardized preprocessing pipeline; and (iii) systematic evaluation—within the MoleculeNet framework—of deep neural networks (e.g., DeepTox), ensemble methods, and descriptor-based self-normalizing models.
Contribution/Results: Our evaluation reveals that early state-of-the-art models—particularly DeepTox—remain significantly superior to many recent approaches, indicating limited substantive advances in toxicity prediction over the last ten years. The primary contribution is the establishment of the first open, reproducible, and standardized Tox21 benchmark, providing a rigorous foundation for evaluating AI-driven toxicological modeling.
📝 Abstract
Deep learning's rise since the early 2010s has transformed fields like computer vision and natural language processing and strongly influenced biomedical research. For drug discovery specifically, a key inflection - akin to vision's"ImageNet moment"- arrived in 2015, when deep neural networks surpassed traditional approaches on the Tox21 Data Challenge. This milestone accelerated the adoption of deep learning across the pharmaceutical industry, and today most major companies have integrated these methods into their research pipelines. After the Tox21 Challenge concluded, its dataset was included in several established benchmarks, such as MoleculeNet and the Open Graph Benchmark. However, during these integrations, the dataset was altered and labels were imputed or manufactured, resulting in a loss of comparability across studies. Consequently, the extent to which bioactivity and toxicity prediction methods have improved over the past decade remains unclear. To this end, we introduce a reproducible leaderboard, hosted on Hugging Face with the original Tox21 Challenge dataset, together with a set of baseline and representative methods. The current version of the leaderboard indicates that the original Tox21 winner - the ensemble-based DeepTox method - and the descriptor-based self-normalizing neural networks introduced in 2017, continue to perform competitively and rank among the top methods for toxicity prediction, leaving it unclear whether substantial progress in toxicity prediction has been achieved over the past decade. As part of this work, we make all baselines and evaluated models publicly accessible for inference via standardized API calls to Hugging Face Spaces.