Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Supervised fine-tuning (SFT) of protein language models (PLMs) is hindered by the scarcity of high-quality experimentally annotated data. Method: We propose a self-distillation–based SFT framework that leverages the PLM itself to generate candidate sequences, followed by lightweight screening and domain-specific filtering—e.g., structural stability and expressibility—to construct a high-confidence training subset, thereby closing a “generate–filter–fine-tune” loop. Contribution/Results: This approach eliminates reliance on precompiled experimental datasets and enables, for the first time, PLM-driven self-consistent data construction and model evolution. Validated on the tryptophan synthase family, generated sequences exhibit significantly improved thermal stability and catalytic activity. Moreover, the method achieves superior target constraint satisfaction rates and emergent properties—including foldability and expressibility—outperforming all baselines and substantially expanding the design space for non-natural functional proteins.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM's output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome-scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine-tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.

Problem

Research questions and friction points this paper is trying to address.

Improves protein sequence fidelity and novelty via self-distillation fine-tuning

Generates stable, functional enzymes without costly experimental datasets

Enables broader exploration of protein sequence space beyond natural variants

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distillation fine-tuning for protein language models

Lightweight curation pipeline with domain-specific filters

Generates novel, stable, and functional enzyme sequences

🔎 Similar Papers

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing