🤖 AI Summary
Existing genomic studies predominantly operate at the gene level, lacking fine-grained, variant-level semantic representations. To address this gap, we present the first systematic construction of a genome-wide variant-level semantic embedding space—covering all 8.9 billion possible single-nucleotide variants (SNVs) in the human genome. Our method integrates OpenAI’s text-embedding-3-large and Qwen3-Embedding-0.6B models with multi-source functional annotations—including FAVOR, ClinVar, and the GWAS Catalog—to generate interpretable, multi-scale embeddings. Evaluated on variant-trait prediction tasks, our embeddings consistently outperform state-of-the-art baselines, improving AUC by 3.2–5.7% across multiple genetic risk prediction benchmarks. All embeddings are publicly released on Hugging Face, establishing a novel foundational representation for genome-wide association studies (GWAS), pathogenicity inference, and precision medicine applications.
📝 Abstract
Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus only on gene-level information. We present one of the first systematic frameworks to generate variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we constructed semantic text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3+MEGA variants, ~90 million imputed UK Biobank variants, and ~9 billion all possible variants. Embeddings were produced with both OpenAI's text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline experiments demonstrate high predictive accuracy for variant properties, validating the embeddings as structured representations of genomic variation. We outline two downstream applications: embedding-informed hypothesis testing by extending the Frequentist And Bayesian framework to genome-wide association studies, and embedding-augmented genetic risk prediction that enhances standard polygenic risk scores. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.