🤖 AI Summary
Current antibody affinity evaluation methods typically analyze antibody sequences or structures in isolation, lacking a unified benchmark that treats the antibody–antigen (Ab–Ag) complex as the functional unit and reflects true binding capability. To address this, we propose AbBiBench—the first function-oriented evaluation framework grounded in complex likelihood estimation, breaking from conventional single-antibody assessment paradigms. AbBiBench integrates masked language modeling, autoregressive generation, inverse folding, diffusion-based structure generation, and geometric graph neural networks, jointly scoring candidates across experimental affinity, structural integrity, and biophysical properties. We systematically evaluate 14 state-of-the-art models on a benchmark comprising 9 antigens and 156,000 antibody variants. Results show that structure-conditioned inverse folding models achieve top performance. In an H1N1 antibody design case study, AbBiBench demonstrates strong predictive validity: model-derived complex likelihood correlates significantly with experimental dissociation constants (K<sub>D</sub>; Pearson *r* = 0.72).
📝 Abstract
We introduce AbBiBench (Antibody Binding Benchmarking), a benchmarking framework for antibody binding affinity maturation and design. Unlike existing antibody evaluation strategies that rely on antibody alone and its similarity to natural ones (e.g., amino acid identity rate, structural RMSD), AbBiBench considers an antibody-antigen (Ab-Ag) complex as a functional unit and evaluates the potential of an antibody design binding to given antigen by measuring protein model's likelihood on the Ab-Ag complex. We first curate, standardize, and share 9 datasets containing 9 antigens (involving influenza, anti-lysozyme, HER2, VEGF, integrin, and SARS-CoV-2) and 155,853 heavy chain mutated antibodies. Using these datasets, we systematically compare 14 protein models including masked language models, autoregressive language models, inverse folding models, diffusion-based generative models, and geometric graph models. The correlation between model likelihood and experimental affinity values is used to evaluate model performance. Additionally, in a case study to increase binding affinity of antibody F045-092 to antigen influenza H1N1, we evaluate the generative power of the top-performing models by sampling a set of new antibodies binding to the antigen and ranking them based on structural integrity and biophysical properties of the Ab-Ag complex. As a result, structure-conditioned inverse folding models outperform others in both affinity correlation and generation tasks. Overall, AbBiBench provides a unified, biologically grounded evaluation framework to facilitate the development of more effective, function-aware antibody design models.