🤖 AI Summary
This study addresses the challenge of mRNA vaccine sequence optimization by introducing the first unified modeling framework for the entire mRNA sequence—including 5′UTR, coding sequence (CDS), and 3′UTR—to jointly enhance translational efficiency, transcript stability, and degradation kinetics prediction. We propose a novel architecture integrating Structured State Space Models (SSMs) with multi-head attention, coupled with a nucleotide-codon joint tokenization scheme and a bioinformatics-guided two-stage pretraining paradigm. Our model supports end-to-end modeling of sequences up to 6,000+ nucleotides—six times longer than current state-of-the-art models—while using only 10% of their parameter count. It achieves superior performance across multitask UTR/CDS prediction benchmarks, outperforming all existing methods. The code and pretrained models are publicly released.
📝 Abstract
mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (https://github.com/helicalAI/helical) and model weights (https://huggingface.co/helical-ai/helix-mRNA).