🤖 AI Summary
Protein design faces the “motif–scaffold matching” challenge—identifying diverse protein backbones that precisely accommodate and preserve a given geometric motif’s conformation. Current evaluation practices lack standardization, hindering reproducibility and cross-method comparison.
Method: We introduce MotifBench, the first standardized benchmark for this task. It comprises 30 high-difficulty test cases—including instances with known solutions that all mainstream methods fail—under a rigorously defined, fully reproducible evaluation protocol. Our framework integrates AlphaFold2-based structure prediction, fixed-backbone sequence design, explicit geometric constraint modeling, and multi-dimensional structural and sequence metrics.
Contribution/Results: MotifBench enables fair, quantitative comparison across methods. We open-source all code, data, and a live leaderboard. For the first time, it systematically exposes critical limitations of state-of-the-art approaches, providing a robust foundation for advancing motif-driven protein design.
📝 Abstract
The motif-scaffolding problem is a central task in computational protein design: Given the coordinates of atoms in a geometry chosen to confer a desired biochemical function (a motif), the task is to identify diverse protein structures (scaffolds) that include the motif and maintain its geometry. Significant recent progress on motif-scaffolding has been made due to computational evaluation with reliable protein structure prediction and fixed-backbone sequence design methods. However, significant variability in evaluation strategies across publications has hindered comparability of results, challenged reproducibility, and impeded robust progress. In response we introduce MotifBench, comprising (1) a precisely specified pipeline and evaluation metrics, (2) a collection of 30 benchmark problems, and (3) an implementation of this benchmark and leaderboard at github.com/blt2114/MotifBench. The MotifBench test cases are more difficult compared to earlier benchmarks, and include protein design problems for which solutions are known but on which, to the best of our knowledge, state-of-the-art methods fail to identify any solution.