Incorporating LLM Embeddings for Variation Across the Human Genome

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing genomic studies predominantly operate at the gene level, lacking fine-grained, variant-level semantic representations. To address this gap, we present the first systematic construction of a genome-wide variant-level semantic embedding space—covering all 8.9 billion possible single-nucleotide variants (SNVs) in the human genome. Our method integrates OpenAI’s text-embedding-3-large and Qwen3-Embedding-0.6B models with multi-source functional annotations—including FAVOR, ClinVar, and the GWAS Catalog—to generate interpretable, multi-scale embeddings. Evaluated on variant-trait prediction tasks, our embeddings consistently outperform state-of-the-art baselines, improving AUC by 3.2–5.7% across multiple genetic risk prediction benchmarks. All embeddings are publicly released on Hugging Face, establishing a novel foundational representation for genome-wide association studies (GWAS), pathogenicity inference, and precision medicine applications.

Technology Category

Application Category

📝 Abstract

Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus only on gene-level information. We present one of the first systematic frameworks to generate variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we constructed semantic text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3+MEGA variants, ~90 million imputed UK Biobank variants, and ~9 billion all possible variants. Embeddings were produced with both OpenAI's text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline experiments demonstrate high predictive accuracy for variant properties, validating the embeddings as structured representations of genomic variation. We outline two downstream applications: embedding-informed hypothesis testing by extending the Frequentist And Bayesian framework to genome-wide association studies, and embedding-augmented genetic risk prediction that enhances standard polygenic risk scores. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.

Problem

Research questions and friction points this paper is trying to address.

Generating variant-level embeddings across entire human genome

Creating semantic descriptions for 8.9 billion genomic variants

Developing embedding-based methods for genetic risk prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating variant-level embeddings across human genome

Using semantic text descriptions for 8.9 billion variants

Applying embeddings to association studies and risk prediction

🔎 Similar Papers

A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models

2024-09-11arXiv.orgCitations: 0

Authors to Follow