🤖 AI Summary
This work addresses the disconnect between generative modeling and sequence optimization in structure-based de novo protein binder design by introducing Proteina-Complexa, a method that unifies conditional generation and “hallucination”-based optimization for the first time. Built upon an all-atom representation, Proteina-Complexa employs a flow-based latent space generative architecture, integrating generative pretraining with inference-time optimization strategies. The authors also construct Teddymer, a large-scale synthetic dataset of binding pairs, to support model training and evaluation. On computational benchmarks, Proteina-Complexa substantially outperforms existing approaches, demonstrating significantly higher success rates in vitro. Furthermore, the framework successfully generalizes to small-molecule target binding and enzyme design tasks, highlighting its versatility and robustness in diverse protein design scenarios.
📝 Abstract
Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors ("hallucination"). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.