GenoBERT: A Language Model for Accurate Genotype Imputation

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional reference-panel-based genotype imputation methods, which suffer from ancestral bias and poor accuracy for rare variants. The authors propose GenoBERT, the first approach to apply a Transformer architecture to this task without relying on external reference panels. By introducing a genotype tokenization scheme, a 128-SNP contextual window, and bidirectional self-attention, GenoBERT effectively captures both local and long-range linkage disequilibrium patterns. Evaluated on the LOS and 1KGP datasets, GenoBERT achieves imputation accuracy (r²) exceeding 0.90 even at high missingness rates of 50%, and reaches r² ≈ 0.98 under typical missingness levels (≤25%), substantially outperforming baseline methods such as Beagle5.4. Furthermore, it demonstrates robust performance across diverse ancestral backgrounds.
📝 Abstract
Genotype imputation enables dense variant coverage for genome-wide association and risk-prediction studies, yet conventional reference-panel methods remain limited by ancestry bias and reduced rare-variant accuracy. We present Genotype Bidirectional Encoder Representations from Transformers (GenoBERT), a transformer-based, reference-free framework that tokenizes phased genotypes and uses a self-attention mechanism to capture both short- and long-range linkage disequilibrium (LD) dependencies. Benchmarking on two independent datasets including the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP) across ancestry groups and multiple genotype missingness levels (5-50%) shows that GenoBERT achieves the highest overall accuracy compared to four baseline methods (Beagle5.4, SCDA, BiU-Net, and STICI). At practical sparsity levels (up to 25% missing), GenoBERT attains high overall imputation accuracy ($r^2 approx 0.98$) across datasets, and maintains robust performance ($r^2 > 0.90$) even at 50% missingness. Experimental results across different ancestries confirm consistent gains across datasets, with resilience to small sample sizes and weak LD. A 128-SNP (single-nucleotide polymorphism) context window (approximately 100 Kb) is validated through LD-decay analyses as sufficient to capture local correlation structures. By eliminating reference-panel dependence while preserving high accuracy, GenoBERT provides a scalable and robust solution for genotype imputation and a foundation for downstream genomic modeling.
Problem

Research questions and friction points this paper is trying to address.

genotype imputation
ancestry bias
rare-variant accuracy
reference-panel dependence
linkage disequilibrium
Innovation

Methods, ideas, or system contributions that make the work stand out.

GenoBERT
reference-free imputation
transformer-based genotype modeling
linkage disequilibrium
self-attention mechanism
🔎 Similar Papers
No similar papers found.
L
Lei Huang
School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, USA
Chuan Qiu
Chuan Qiu
School of Medicine, Tulane University, USA
Biostatistics & Bioinformatics
Kuan-Jui Su
Kuan-Jui Su
Tulane University, Division of Biomedical Informatics and Genomics
Bioinformaticsmachine learningnetwork analysissystem biology
Anqi Liu
Anqi Liu
Tulane University
Human GeneticsComputational BiologyBioinformaticsDeep Learning
Y
Yun Gong
Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
W
Weiqiang Lin
Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
L
Lindong Jiang
Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
Chen Zhao
Chen Zhao
Assistant Professor of Computer Science, Kennesaw State University
deep learningmedical image processing
Meng Song
Meng Song
PhD Student of Computer Science, University of California, San Diego
Reinforcement LearningSelf-supervised LearningRobot Learning
Jeffrey Deng
Jeffrey Deng
Dartmouth College
Qing Tian
Qing Tian
University of Alabama at Birmingham
Computer VisionMachine LearningDeep LearningAutonomous Driving
Z
Zhe Luo
Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
P
Ping Gong
Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, USA
H
Hui Shen
Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
C
Chaoyang Zhang
School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, USA
H
Hong-Wen Deng
Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA