Multi-megabase scale genome interpretation with genetic language models

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the challenge of deciphering molecular mechanisms by which genetic variants in ultra-long DNA sequences—up to 88 Mb—drive human disease. To this end, we propose Phenformer, the first end-to-end genetic language model designed for whole-genome, multi-megabase-scale analysis. Built upon the Transformer architecture, Phenformer integrates multi-scale biological semantics with long-range sequence modeling and is trained in a multi-task supervised manner solely on >150,000 whole-genome sequencing samples, leveraging combined eQTL and GWAS signals—without requiring experimental validation data. Its key contribution is the first purely computational framework capable of generating mechanistic hypotheses across biological scales (cell → tissue → individual). Experiments demonstrate that Phenformer’s hypotheses achieve significantly higher literature support than those from existing methods; moreover, it improves disease risk prediction AUC and exhibits enhanced generalizability across non-European populations.

Technology Category

Application Category

📝 Abstract

Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.

Problem

Research questions and friction points this paper is trying to address.

Genomic Information

Gene Variation

Disease Causality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phenformer

Disease Risk Prediction

Long DNA Sequences

🔎 Similar Papers

Advancing bioinformatics with large language models: components, applications and perspectives