🤖 AI Summary
This work addresses the limitations of conventional reference-panel-based genotype imputation methods, which suffer from ancestral bias and poor accuracy for rare variants. The authors propose GenoBERT, the first approach to apply a Transformer architecture to this task without relying on external reference panels. By introducing a genotype tokenization scheme, a 128-SNP contextual window, and bidirectional self-attention, GenoBERT effectively captures both local and long-range linkage disequilibrium patterns. Evaluated on the LOS and 1KGP datasets, GenoBERT achieves imputation accuracy (r²) exceeding 0.90 even at high missingness rates of 50%, and reaches r² ≈ 0.98 under typical missingness levels (≤25%), substantially outperforming baseline methods such as Beagle5.4. Furthermore, it demonstrates robust performance across diverse ancestral backgrounds.
📝 Abstract
Genotype imputation enables dense variant coverage for genome-wide association and risk-prediction studies, yet conventional reference-panel methods remain limited by ancestry bias and reduced rare-variant accuracy. We present Genotype Bidirectional Encoder Representations from Transformers (GenoBERT), a transformer-based, reference-free framework that tokenizes phased genotypes and uses a self-attention mechanism to capture both short- and long-range linkage disequilibrium (LD) dependencies. Benchmarking on two independent datasets including the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP) across ancestry groups and multiple genotype missingness levels (5-50%) shows that GenoBERT achieves the highest overall accuracy compared to four baseline methods (Beagle5.4, SCDA, BiU-Net, and STICI). At practical sparsity levels (up to 25% missing), GenoBERT attains high overall imputation accuracy ($r^2 approx 0.98$) across datasets, and maintains robust performance ($r^2 > 0.90$) even at 50% missingness. Experimental results across different ancestries confirm consistent gains across datasets, with resilience to small sample sizes and weak LD. A 128-SNP (single-nucleotide polymorphism) context window (approximately 100 Kb) is validated through LD-decay analyses as sufficient to capture local correlation structures. By eliminating reference-panel dependence while preserving high accuracy, GenoBERT provides a scalable and robust solution for genotype imputation and a foundation for downstream genomic modeling.