Multi-megabase scale genome interpretation with genetic language models

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of deciphering molecular mechanisms by which genetic variants in ultra-long DNA sequences—up to 88 Mb—drive human disease. To this end, we propose Phenformer, the first end-to-end genetic language model designed for whole-genome, multi-megabase-scale analysis. Built upon the Transformer architecture, Phenformer integrates multi-scale biological semantics with long-range sequence modeling and is trained in a multi-task supervised manner solely on >150,000 whole-genome sequencing samples, leveraging combined eQTL and GWAS signals—without requiring experimental validation data. Its key contribution is the first purely computational framework capable of generating mechanistic hypotheses across biological scales (cell → tissue → individual). Experiments demonstrate that Phenformer’s hypotheses achieve significantly higher literature support than those from existing methods; moreover, it improves disease risk prediction AUC and exhibits enhanced generalizability across non-European populations.

Technology Category

Application Category

📝 Abstract
Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.
Problem

Research questions and friction points this paper is trying to address.

Genomic Information
Gene Variation
Disease Causality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Phenformer
Disease Risk Prediction
Long DNA Sequences
🔎 Similar Papers
No similar papers found.
F
Frederik Trauble
GSK plc, Zug, Switzerland
L
Lachlan Stuart
GSK plc, Zug, Switzerland
A
Andreas Georgiou
GSK plc, Zug, Switzerland
Pascal Notin
Pascal Notin
Harvard University
Artificial IntelligenceGenerative modelsComputational biologyProtein design
Arash Mehrjou
Arash Mehrjou
ETH Zürich - Max Planck Institute - GSK.ai
Machine LearningControl TheoryCausality
R
Ron Schwessinger
GSK plc, Zug, Switzerland
M
Mathieu Chevalley
GSK plc, Zug, Switzerland; ETH Zurich, Switzerland
K
Kim Branson
GSK plc, Zug, Switzerland
B
Bernhard Scholkopf
Max Planck Institute for Intelligent Systems & ELLIS Institute, Tübingen, Germany
C
Cornelia van Duijn
Nuffield Department of Population Health, University of Oxford, Oxford, United Kingdom
Debora Marks
Debora Marks
Systems Biology, Harvard Medical School, Broad Institute of Harvard and MIT
machine learninggenomicsdrug designstatistical inferenceprotein folding & design
Patrick Schwab
Patrick Schwab
GSK
Causal Machine LearningAI in Drug DiscoveryAI in HealthcareAI in Medicine