Conditional generation of antibody sequences with classifier-guided germline-absorbing discrete diffusion

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
Existing antibody generation methods heavily rely on germline sequences, struggling to accurately model somatic hypermutation and lacking flexible conditional control. This work proposes a novel discrete diffusion model that innovatively treats the germline sequence as an absorbing state in the diffusion process, effectively decoupling V(D)J recombination from somatic variation and substantially reducing germline bias. By integrating classifier-guided sampling with a protein language model, the method improves non-germline residue prediction accuracy from 26% to 46%. Furthermore, in conditional generation tasks such as hydrophobicity and binding affinity, it significantly outperforms EvoProtGrad, achieving a superior trade-off between sample quality and adherence to specified conditions.
📝 Abstract
Antibody therapeutics are among the most successful modern medicines, yet computationally designing antibodies with desirable binding and developability properties remains challenging. While protein language models (pLMs) have emerged as powerful tools for antibody sequence design, existing approaches largely suffer from two key limitations: they predominantly memorize germline sequences rather than modeling biologically meaningful somatic variation, and they offer limited support for flexible classifier-guided conditional generation. We address these challenges through two primary contributions. First, we demonstrate that discrete diffusion fine-tuning achieves strong language modeling performance on antibody sequences while allowing for generation conditioned on any off-the-shelf classifier. Second, we introduce germline absorbing diffusion, a novel modification of the discrete diffusion noise process in which the germline sequence - rather than a masked sequence - serves as the absorbing state. This biologically motivated inductive bias restricts the model to learning the trajectory from germline to observed sequence, effectively excluding genetic variation and V(D)J recombination statistics from the learned distribution and dramatically mitigating germline bias. We show that germline diffusion improves non-germline residue prediction accuracy from 26 percent to 46 percent, approaching the theoretical upper bound set by true biological variability. We then demonstrate the utility of our germline diffusion model on the conditional generation tasks of sampling antibodies with improved hydrophobicity and predicted binding affinity. On both tasks our model shows an improved tradeoff between class adherence and sample quality, significantly outperforming EvoProtGrad, a popular strategy to sample from pLMs with gradient-based discrete Markov Chain Monte Carlo.
Problem

Research questions and friction points this paper is trying to address.

antibody sequence generation
germline bias
conditional generation
somatic variation
developability properties
Innovation

Methods, ideas, or system contributions that make the work stand out.

germline-absorbing diffusion
discrete diffusion
classifier-guided generation
antibody sequence design
protein language models