AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of integrating heterogeneous evidence sources for pathogenicity prediction of missense variants by proposing a scalable, genome-wide annotation and prediction framework. For the first time, it unifies protein language models (AlphaMissense and ESM) with a comprehensive set of 303 multidimensional biological features—including evolutionary conservation, population frequency, transcript context, and handcrafted annotations—within a single architecture. Evaluated via cross-validation on 132,714 ClinVar variants, the XGBoost-based model achieves an MCC of 0.9411 and ROC-AUC of 0.9950; on temporally held-out variants, it attains an MCC of 0.7613 and accuracy of 87.98%. The project releases a public benchmark dataset and provides pathogenicity scores and binary predictions for over 90 million hg38 missense variants.
📝 Abstract
Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein-language-model-derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome-wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM-derived features, conservation metrics, population-frequency variables, established pathogenicity predictors and engineered amino acid/codon-context features. Using 132,714 ClinVar-labelled missense variants, we benchmarked machine-learning and deep-learning models under controlled feature configurations. The full 303-feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC-AUC = 0.9950 across stratified five-fold cross-validation. Restricted naive and location-oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity-controlled ablations showed that removing prior-predictor, population-frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM-derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1-score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.
Problem

Research questions and friction points this paper is trying to address.

missense variant
pathogenicity prediction
variant interpretation
genomic annotation
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

missense variant prediction
genome-wide annotation
benchmarking framework
protein language models
XGBoost
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid
Muhammad Muneeb
Muhammad Muneeb
Unknown affiliation
D
David B. Ascher
1School of Chemistry and Molecular Biology, The University of Queensland, Queen Street, 4067, Queensland, Australia and 2Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Commercial Road, 3004, Victoria, Australia