GENERator: A Long-Context Generative Genomic Foundation Model

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) face limitations in genomic sequence analysis due to constrained context length, insufficient parameter scale, and lack of biological priors, resulting in poor robustness and generalization. To address this, we introduce the first generative genomic foundation model—1.2B parameters, trained on 386 billion base pairs of eukaryotic DNA—with a 98-kilobase context window. Our method enables end-to-end, central-dogma-aligned protein-coding sequence generation, supporting prompt-driven, function-oriented design. We incorporate biologically grounded pretraining objectives, genome-specific tokenization, and optimized positional encoding. The model achieves state-of-the-art performance across multiple genomic benchmarks. Experimental validation confirms its ability to generate protein-coding sequences with native-like structural properties and to controllably design highly active promoters—demonstrating a 3.2-fold increase in transcriptional activity over baselines.

Technology Category

Application Category

📝 Abstract
Advancements in DNA sequencing technologies have significantly improved our ability to decode genomic sequences. However, the prediction and interpretation of these sequences remain challenging due to the intricate nature of genetic material. Large language models (LLMs) have introduced new opportunities for biological sequence analysis. Recent developments in genomic language models have underscored the potential of LLMs in deciphering DNA sequences. Nonetheless, existing models often face limitations in robustness and application scope, primarily due to constraints in model structure and training data scale. To address these limitations, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences that translate into proteins structurally analogous to known families. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences with specific activity profiles. These capabilities position the GENERator as a pivotal tool for genomic research and biotechnological advancement, enhancing our ability to interpret and predict complex biological systems and enabling precise genomic interventions.
Problem

Research questions and friction points this paper is trying to address.

Predicting and interpreting complex genomic sequences
Overcoming limitations in robustness and application scope of genomic models
Enhancing sequence optimization and precise genomic interventions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-context generative model
1.2B parameters
386B bp training dataset
🔎 Similar Papers
No similar papers found.
W
Wei Wu
Apsara Lab, Alibaba Cloud, Beijing, China
Qiuyi Li
Qiuyi Li
Zhongguancun Academy & Zhongguancun Institute of Artificial Intelligence
GenomicsFoundation modelLarge language modelMachine learning
M
Mingyang Li
Apsara Lab, Alibaba Cloud, Beijing, China
K
Kun Fu
Apsara Lab, Alibaba Cloud, Beijing, China
F
Fuli Feng
University of Science and Technology of China, Hefei, China
J
Jieping Ye
Apsara Lab, Alibaba Cloud, Beijing, China
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser
Z
Zheng Wang
Apsara Lab, Alibaba Cloud, Beijing, China