🤖 AI Summary
This work addresses protein sequence engineering by proposing a zero-shot optimization framework that requires no fine-tuning on biological data. Methodologically, it is the first to uncover the latent protein sequence optimization capability of large language models (LLMs), leveraging task-oriented prompt engineering, multi-objective Pareto frontier search, and iterative sampling with reweighting under experimental budget constraints—enabling efficient directed evolution starting from wild-type sequences. Its key contribution lies in breaking the conventional paradigm reliant on biological-data fine-tuning, instead enabling cross-modal knowledge transfer for de novo protein design. Experiments across multiple synthetic and real-world fitness landscapes demonstrate that the method achieves significantly higher discovery rates of high-functionality sequences using fewer mutations and experimental rounds than state-of-the-art baselines.
📝 Abstract
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence. Directed evolution has been a dominating paradigm in this field which has an iterative process to generate variants and select via experimental feedback. We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequence optimizers. With a directed evolutionary method, LLM can perform protein engineering through Pareto and experiment-budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes.