Evolutionary Profiles for Protein Fitness Prediction

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Predicting the fitness effects of protein mutations is hindered by the fundamental tension between sparse experimental data and the astronomically large sequence space. To address this, we propose EvoIF, the first model to reinterpret masked language modeling (MLM) in protein language models as an inverse reinforcement learning framework, where natural evolutionary trajectories serve as implicit reward signals. EvoIF jointly incorporates intra-family evolutionary spectra and inter-family structure–evolution constraints, and introduces reverse-folding logit distillation alongside lightweight transition blocks to enable sequence–structure co-representation. Evaluated on the ProteinGym benchmark—comprising 217 experimental assays and over 2.5 million variants—EvoIF achieves state-of-the-art or near-state-of-the-art performance using only 0.15% of the training data and fewer parameters than competing methods. It demonstrates significantly improved few-shot generalization and cross-assay robustness.

Technology Category

Application Category

📝 Abstract
Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays;>2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at https://github.com/aim-uofa/EvoIF.
Problem

Research questions and friction points this paper is trying to address.

Predicting fitness impact of protein mutations efficiently
Integrating evolutionary signals from multiple sequence alignments
Improving robustness across diverse functional and taxonomic contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

EvoIF integrates within-family and cross-family evolutionary profiles
It fuses sequence-structure representations via compact transition block
Achieves state-of-art performance with minimal training data and parameters
🔎 Similar Papers
No similar papers found.
Jigang Fan
Jigang Fan
Student Researcher at Stanford University and Princeton University; Student at Peking University
Machine LearningAI for ScienceComputational Biology
X
Xiaoran Jiao
Zhejiang University
S
Shengdong Lin
Zhejiang University, East China University of Science and Technology
Z
Zhanming Liang
Zhejiang University, Chengdu University of Information Technology
Weian Mao
Weian Mao
MIT CSAIL
C
Chenchen Jing
Zhejiang University of Technology
H
Hao Chen
Zhejiang University
Chunhua Shen
Chunhua Shen
Zhejiang University
Computer VisionMachine Learning