🤖 AI Summary
Predicting the fitness effects of protein mutations is hindered by the fundamental tension between sparse experimental data and the astronomically large sequence space. To address this, we propose EvoIF, the first model to reinterpret masked language modeling (MLM) in protein language models as an inverse reinforcement learning framework, where natural evolutionary trajectories serve as implicit reward signals. EvoIF jointly incorporates intra-family evolutionary spectra and inter-family structure–evolution constraints, and introduces reverse-folding logit distillation alongside lightweight transition blocks to enable sequence–structure co-representation. Evaluated on the ProteinGym benchmark—comprising 217 experimental assays and over 2.5 million variants—EvoIF achieves state-of-the-art or near-state-of-the-art performance using only 0.15% of the training data and fewer parameters than competing methods. It demonstrates significantly improved few-shot generalization and cross-assay robustness.
📝 Abstract
Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays;>2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at https://github.com/aim-uofa/EvoIF.