AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This study addresses the challenge of integrating heterogeneous evidence sources for pathogenicity prediction of missense variants by proposing a scalable, genome-wide annotation and prediction framework. For the first time, it unifies protein language models (AlphaMissense and ESM) with a comprehensive set of 303 multidimensional biological features—including evolutionary conservation, population frequency, transcript context, and handcrafted annotations—within a single architecture. Evaluated via cross-validation on 132,714 ClinVar variants, the XGBoost-based model achieves an MCC of 0.9411 and ROC-AUC of 0.9950; on temporally held-out variants, it attains an MCC of 0.7613 and accuracy of 87.98%. The project releases a public benchmark dataset and provides pathogenicity scores and binary predictions for over 90 million hg38 missense variants.

📝 Abstract

Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein-language-model-derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome-wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM-derived features, conservation metrics, population-frequency variables, established pathogenicity predictors and engineered amino acid/codon-context features. Using 132,714 ClinVar-labelled missense variants, we benchmarked machine-learning and deep-learning models under controlled feature configurations. The full 303-feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC-AUC = 0.9950 across stratified five-fold cross-validation. Restricted naive and location-oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity-controlled ablations showed that removing prior-predictor, population-frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM-derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1-score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.

Problem

Research questions and friction points this paper is trying to address.

missense variant

pathogenicity prediction

variant interpretation

genomic annotation

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

missense variant prediction

genome-wide annotation

benchmarking framework