Kernel-Based Evaluation of Conditional Biological Sequence Models

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of reliable evaluation metrics for conditional sequence models in computational biology (e.g., ProteinMPNN). We propose Augmented Conditional Maximum Mean Discrepancy (ACMMD), a novel, kernel-based measure of conditional distribution discrepancy that admits unbiased estimation and is suitable for statistical hypothesis testing. ACMMD is the first method to systematically integrate kernel methods into reliability assessment and hyperparameter optimization for conditional biological sequence modeling, enabling principled model selection and quantitative tuning of critical hyperparameters such as sampling temperature. Experiments demonstrate that ProteinMPNN exhibits significant deviation from true distributions across multiple protein families; however, temperature optimization guided by ACMMD substantially improves its distributional fidelity. The proposed framework establishes a theoretically rigorous and empirically practical evaluation paradigm for conditional generative models in protein design.

Technology Category

Application Category

📝 Abstract
We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model's temperature hyperparameter to achieve a better fit.
Problem

Research questions and friction points this paper is trying to address.

Evaluating conditional biological sequence models using kernel-based methods
Measuring discrepancy between true and estimated conditional distributions
Testing model fit and tuning hyperparameters for protein design models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernel-based tools evaluate conditional sequence models
Augmented Conditional Maximum Mean Discrepancy measures distribution discrepancy
Method enables model testing and hyperparameter tuning
🔎 Similar Papers
No similar papers found.
Pierre Glaser
Pierre Glaser
Gatsby Computational Neuroscience Unit, UCL
Machine Learning
S
Steffanie Paul
Systems Biology, Harvard Medical School, Boston, USA
Alissa M. Hummer
Alissa M. Hummer
Stanford University
Machine LearningMolecular DesignComputational Structural Biology
C
Charlotte M. Deane
Department of Statistics, University of Oxford, Oxford, UK
D
Debora S. Marks
Harvard Medical School, Broad Institute, Boston, USA
Alan N. Amin
Alan N. Amin
Faculty Fellow at New York University