🤖 AI Summary
This study addresses the lack of reliable evaluation metrics for conditional sequence models in computational biology (e.g., ProteinMPNN). We propose Augmented Conditional Maximum Mean Discrepancy (ACMMD), a novel, kernel-based measure of conditional distribution discrepancy that admits unbiased estimation and is suitable for statistical hypothesis testing. ACMMD is the first method to systematically integrate kernel methods into reliability assessment and hyperparameter optimization for conditional biological sequence modeling, enabling principled model selection and quantitative tuning of critical hyperparameters such as sampling temperature. Experiments demonstrate that ProteinMPNN exhibits significant deviation from true distributions across multiple protein families; however, temperature optimization guided by ACMMD substantially improves its distributional fidelity. The proposed framework establishes a theoretically rigorous and empirically practical evaluation paradigm for conditional generative models in protein design.
📝 Abstract
We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model's temperature hyperparameter to achieve a better fit.