Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

DNA large language models face dual bottlenecks in long-sequence modeling: quadratic computational complexity of self-attention and linear growth of key-value (KV) cache. Existing window-based truncation methods compromise fidelity of long-range dependencies. To address this, we propose FOCUS—a novel k-mer-granularity hierarchical context compression module. FOCUS introduces shared boundary windows, summary token insertion, multi-layer progressive compression, and randomized scheduling to enable lossless cross-window long-range information propagation. Evaluated on the Evo-2 base model, FOCUS compresses 1-kb contexts into just 10 summary tokens, reducing KV cache by ~100× and achieving near-linear inference latency scaling. Crucially, nucleotide-level probability deviation remains negligible (0.0004), preserving modeling accuracy. This significantly enhances both efficiency and capability for ultra-long DNA sequence modeling.

Technology Category

Application Category

📝 Abstract

Trained on massive cross-species DNA corpora, DNA large language models (LLMs) learn the fundamental"grammar"and evolutionary patterns of genomic sequences. This makes them powerful priors for DNA sequence modeling, particularly over long ranges. However, two major constraints hinder their use in practice: the quadratic computational cost of self-attention and the growing memory required for key-value (KV) caches during autoregressive decoding. These constraints force the use of heuristics such as fixed-window truncation or sliding windows, which compromise fidelity on ultra-long sequences by discarding distant information. We introduce FOCUS (Feature-Oriented Compression for Ultra-long Self-attention), a progressive context-compression module that can be plugged into pretrained DNA LLMs. FOCUS combines the established k-mer representation in genomics with learnable hierarchical compression: it inserts summary tokens at k-mer granularity and progressively compresses attention key and value activations across multiple Transformer layers, retaining only the summary KV states across windows while discarding ordinary-token KV. A shared-boundary windowing scheme yields a stationary cross-window interface that propagates long-range information with minimal loss. We validate FOCUS on an Evo-2-based DNA LLM fine-tuned on GRCh38 chromosome 1 with self-supervised training and randomized compression schedules to promote robustness across compression ratios. On held-out human chromosomes, FOCUS achieves near-lossless fidelity: compressing a 1 kb context into only 10 summary tokens (about 100x) shifts the average per-nucleotide probability by only about 0.0004. Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N^2) to near-linear O(N), enabling about 100x longer inference windows on commodity GPUs with near-lossless fidelity.

Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic computational cost of DNA LLM self-attention

Compressing KV-cache memory for autoregressive DNA sequence decoding

Enabling long-context genomic inference without fidelity loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive context-compression module for DNA LLMs

Combines k-mer representation with hierarchical compression

Reduces KV-cache memory and enables near-linear scaling

🔎 Similar Papers

No similar papers found.