🤖 AI Summary
Existing discrete diffusion-based protein language models rely on masking mechanisms, which struggle to capture the biological reality of protein evolution through gradual substitution, insertion, and deletion operations, thereby limiting their capacity for flexible guided generation and editing. This work proposes DPLM-Evo, the first method to explicitly model evolutionary editing operations within a diffusion framework. It introduces a contextualized evolutionary noising kernel that generates biologically plausible mutations and decouples variable-length observed sequences from an upsampled latent alignment space, enabling efficient variable-length generation. By jointly predicting substitutions, insertions, and deletions, DPLM-Evo achieves state-of-the-art performance in mutation effect prediction under the single-sequence setting of ProteinGym and supports post-hoc protein editing and simulated evolution through explicit edit trajectories.
📝 Abstract
Proteins are shaped by gradual evolution under biophysical and functional constraints. Protein language models learn rich evolutionary constraints from large-scale sequences, and discrete diffusion-based protein language models~(\eg, DPLMs) are promising for both understanding and generation. However, existing DPLMs typically rely on masking-based absorbing diffusion that contradicts a simple biological intuition: proteins evolve through accumulated edits, not by emerging from masks. Consequently, these frameworks lack explicit pretraining objectives for substitution and insertion/deletion (indel) operations, limiting both optimization-style post-editing and flexible guided generation. To address these limitations, we present DPLM-Evo, an evolutionary discrete diffusion framework that explicitly predicts substitution, insertion, and deletion operations during denoising. DPLM-Evo decouples an upsampled-length latent alignment space from the variable-length observed sequence space, which makes indel-aware generation tractable and enables adaptive scaffold growth throughout the process with negligible computational overhead. To better align substitutions with real evolution, we further introduce a contextualized evolutionary noising kernel that produces biologically informed, context-dependent mutation patterns. Across tasks, DPLM-Evo improves sequence understanding and achieves state-of-the-art mutation effect prediction performance on ProteinGym in the single-sequence setting. It also enables variable-length simulated evolution, and post-editing/optimization of existing proteins via explicit edit trajectories.