PepBenchmark: A Standardized Benchmark for Peptide Machine Learning

๐Ÿ“… 2026-04-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

207K/year
๐Ÿค– AI Summary
This work addresses the challenge of fair comparison and progress in machine learning for peptide therapeutics, which has been hindered by the absence of standardized benchmarks. To this end, we introduce PepBenchmarkโ€”the first unified benchmark encompassing 29 canonical and 6 non-canonical peptide datasets, integrated with systematic data curation, feature transformation pipelines, and a consistent evaluation protocol. We establish comprehensive baselines using four prominent methodological families: molecular fingerprints, graph neural networks (GNNs), protein language models (PLMs), and SMILES-based representations. This effort delivers the most extensive AI-ready resource and multi-model leaderboard for peptide research to date, substantially enhancing algorithmic comparability, reproducibility, and practical utility in peptide drug discovery.

Technology Category

Application Category

๐Ÿ“ Abstract
Peptide therapeutics are widely regarded as the "third generation" of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well-curated collection comprising 29 canonical-peptide and 6 non-canonical-peptide datasets across 7 groups, systematically covering key aspects of peptide drug development, representing, to the best of our knowledge, the most comprehensive AI-ready dataset resource to date; (2) PepBenchPipeline, a standardized preprocessing pipeline that ensures consistent dataset cleaning, construction, splitting, and feature transformation, mitigating quality issues common in ad hoc pipelines; and (3) PepBenchLeaderboard, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: Fingerprint-based, GNN-based, PLM-based, and SMILES-based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real-world applications. The data and code are publicly available at https://github.com/ZGCI-AI4S-Pep/PepBenchmark/.
Problem

Research questions and friction points this paper is trying to address.

peptide therapeutics
machine learning
standardized benchmark
drug discovery
AI-ready datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

peptide machine learning
standardized benchmark
data curation
preprocessing pipeline
unified evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.