🤖 AI Summary
This study addresses the challenge of integrating heterogeneous evidence sources for pathogenicity prediction of missense variants by proposing a scalable, genome-wide annotation and prediction framework. For the first time, it unifies protein language models (AlphaMissense and ESM) with a comprehensive set of 303 multidimensional biological features—including evolutionary conservation, population frequency, transcript context, and handcrafted annotations—within a single architecture. Evaluated via cross-validation on 132,714 ClinVar variants, the XGBoost-based model achieves an MCC of 0.9411 and ROC-AUC of 0.9950; on temporally held-out variants, it attains an MCC of 0.7613 and accuracy of 87.98%. The project releases a public benchmark dataset and provides pathogenicity scores and binary predictions for over 90 million hg38 missense variants.
📝 Abstract
Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein-language-model-derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome-wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM-derived features, conservation metrics, population-frequency variables, established pathogenicity predictors and engineered amino acid/codon-context features. Using 132,714 ClinVar-labelled missense variants, we benchmarked machine-learning and deep-learning models under controlled feature configurations. The full 303-feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC-AUC = 0.9950 across stratified five-fold cross-validation. Restricted naive and location-oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity-controlled ablations showed that removing prior-predictor, population-frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM-derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1-score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.