🤖 AI Summary
This work addresses the challenge of modeling long-range dependencies in genomic sequences by introducing the first genome foundation model capable of single-nucleotide-resolution modeling over sequences up to 192 kbp. Methodologically, it adopts a LLaMA-style decoder-only architecture with dense self-attention—eschewing convolutional or state-space modules that impose locality constraints—and employs a phased context expansion strategy to ensure stable training on ultra-long sequences. Key contributions include: (i) the first empirical validation in genomics that dense attention scales effectively to hundred-kilobase sequences, establishing a new paradigm for long-range dependency modeling; and (ii) state-of-the-art performance across diverse tasks—including biological sequence classification, regulatory region identification, chromatin accessibility prediction, pathogenicity assessment of genetic variants, and cross-species classification—while achieving both low perplexity and high sequence reconstruction fidelity. The model is publicly released on Hugging Face.
📝 Abstract
We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.