Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Protein function prediction and design face fundamental challenges under low-data regimes (low-N), where limited labeled sequences hinder model generalizability and interpretability. Method: This work pioneers a systematic investigation of sparse autoencoders (SAEs) for decoding functional structure from the latent space of protein language models—specifically ESM2—under extreme data scarcity (e.g., only 24 labeled sequences). We fine-tune ESM2 to generate task-specific sequence embeddings and deploy SAEs to extract sparse, interpretable, and functionally relevant latent variables. Contribution/Results: SAEs substantially improve low-N functional prediction accuracy, matching or surpassing full ESM2 baselines across most tasks. Crucially, targeted manipulation of identified latent variables yields optimal functional variants in 83% of cases. Our framework establishes a novel, interpretable, and sample-efficient paradigm for low-data protein engineering, bridging representation learning with actionable functional design.

Technology Category

Application Category

📝 Abstract

Predicting protein function from amino acid sequence remains a central challenge in data-scarce (low-$N$) regimes, limiting machine learning-guided protein design when only small amounts of assay-labeled sequence-function data are available. Protein language models (pLMs) have advanced the field by providing evolutionary-informed embeddings and sparse autoencoders (SAEs) have enabled decomposition of these embeddings into interpretable latent variables that capture structural and functional features. However, the effectiveness of SAEs for low-$N$ function prediction and protein design has not been systematically studied. Herein, we evaluate SAEs trained on fine-tuned ESM2 embeddings across diverse fitness extrapolation and protein engineering tasks. We show that SAEs, with as few as 24 sequences, consistently outperform or compete with their ESM2 baselines in fitness prediction, indicating that their sparse latent space encodes compact and biologically meaningful representations that generalize more effectively from limited data. Moreover, steering predictive latents exploits biological motifs in pLM representations, yielding top-fitness variants in 83% of cases compared to designing with ESM2 alone.

Problem

Research questions and friction points this paper is trying to address.

Evaluating sparse autoencoders for low-data protein function prediction

Assessing SAE effectiveness in fitness extrapolation and protein engineering

Improving generalization from limited sequence-function data using sparse representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders decompose protein embeddings interpretably

Sparse latent space encodes compact biological representations effectively

Steering predictive latents exploits biological motifs for design

🔎 Similar Papers

No similar papers found.