Gene42: Long-Range Genomic Foundation Model With Dense Attention

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of modeling long-range dependencies in genomic sequences by introducing the first genome foundation model capable of single-nucleotide-resolution modeling over sequences up to 192 kbp. Methodologically, it adopts a LLaMA-style decoder-only architecture with dense self-attention—eschewing convolutional or state-space modules that impose locality constraints—and employs a phased context expansion strategy to ensure stable training on ultra-long sequences. Key contributions include: (i) the first empirical validation in genomics that dense attention scales effectively to hundred-kilobase sequences, establishing a new paradigm for long-range dependency modeling; and (ii) state-of-the-art performance across diverse tasks—including biological sequence classification, regulatory region identification, chromatin accessibility prediction, pathogenicity assessment of genetic variants, and cross-species classification—while achieving both low perplexity and high sequence reconstruction fidelity. The model is publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract
We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.
Problem

Research questions and friction points this paper is trying to address.

Develops Gene42 to handle genomic sequences up to 192,000 bp.
Challenges state-space models with dense attention for genomics.
Achieves state-of-the-art performance in multiple genomic tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense self-attention mechanism for genomics
Extends context length to 192,000 bp
Decoder-only architecture for genomic modeling
🔎 Similar Papers
No similar papers found.
K
Kirill Vishniakov
M42, Abu Dhabi, UAE.
B
B. Amor
Inception Institute of Artificial Intelligence, Abu Dhabi, UAE.
E
Engin Tekin
Cerebras Systems, Sunnyvale, CA, USA.
N
Nancy A Elnaker
Inception Institute of Artificial Intelligence, Abu Dhabi, UAE.
K
Karthik Viswanathan
M42, Abu Dhabi, UAE.
A
Aleksandr Medvedev
M42, Abu Dhabi, UAE.
A
Aahan Singh
Inception Institute of Artificial Intelligence, Abu Dhabi, UAE.
M
Maryam Nadeem
Inception Institute of Artificial Intelligence, Abu Dhabi, UAE.
M
Mohammad Amaan Sayeed
Inception Institute of Artificial Intelligence, Abu Dhabi, UAE.
P
P. Kanithi
M42, Abu Dhabi, UAE.
T
Tiago Magalhaes
M42, Abu Dhabi, UAE.
Natalia Vassilieva
Natalia Vassilieva
Sr. Director of Product, Cerebras Systems
image analysisinformation retrievalinformatin extractionmachine learningnatural language processing
Dwarikanath Mahapatra
Dwarikanath Mahapatra
Khalifa University
AI in MedicineMedical Image SegmentationMedical Image RegistrationComputer VisionDeep
Marco Pimentel
Marco Pimentel
Post-doctoral Research Assistant, University of Oxford
Artificial IntelligenceMachine LearningSignal ProcessingBiomedical Engineering
S
Shadab Khan
M42, Abu Dhabi, UAE.