Reverse Distillation: Consistently Scaling Protein Language Model Representations

πŸ“… 2026-03-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the performance saturation or degradation often observed when scaling up protein language models. To overcome this limitation, the authors propose a reverse distillation framework that decomposes embeddings from a large model into orthogonal subspaces guided by smaller models from the same family, thereby constructing a nested Matryoshka embedding structure. In this architecture, representations from smaller models serve as prefix subspaces of the larger model’s embeddings, effectively disentangling general-purpose features from incremental information. This approach achieves, for the first time, consistent performance gains with increasing model scale. Evaluated on the ProteinGym benchmark, reverse-distilled variants of ESM-2 consistently outperform baseline models at equivalent embedding dimensions, with the 15-billion-parameter model achieving state-of-the-art performance.

Technology Category

Application Category

πŸ“ Abstract
Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka-style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse-distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly-shared protein features. Reverse distillation isolates these shared features and orthogonally extracts additional contributions from larger models, preventing interference between the two. On ProteinGym benchmarks, reverse-distilled ESM-2 variants outperform their respective baselines at the same embedding dimensionality, with the reverse-distilled 15 billion parameter model achieving the strongest performance. Our framework is generalizable to any model family where scaling challenges persist. Code and trained models are available at https://github.com/rohitsinghlab/plm_reverse_distillation.
Problem

Research questions and friction points this paper is trying to address.

protein language models
scaling laws
model performance
representation scaling
embedding dimensionality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse Distillation
Protein Language Models
Matryoshka Embeddings
Orthogonal Subspaces
Model Scaling
πŸ”Ž Similar Papers
No similar papers found.
D
Darius Catrina
Department of Computer Science, Duke University, Durham, NC, USA
C
Christian Bepler
Department of Computer Science, Duke University, Durham, NC, USA
Samuel Sledzieski
Samuel Sledzieski
Research Fellow at Flatiron Institute CCB
bioinformaticscomputational biologyprotein interactionmachine learningbiological networks
Rohit Singh
Rohit Singh
Duke University
computational biologygenomicsnetwork analysis