Priming: Hybrid State Space Models From Pre-trained Transformers

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

219K/year
πŸ€– AI Summary
This work addresses the limitation that existing large-scale Hybrid State Space Models (Hybrid SSMs) require training from scratch, which hinders architectural exploration and practical deployment. The authors propose Priming, a novel method that enables, for the first time, generalizable transfer across Transformer families, model types, and scales. By combining brief alignment phases with post-training adaptation, Priming efficiently migrates pretrained Transformers to Hybrid SSM architectures. The approach integrates knowledge transfer, sequence parallelism, optimized GKA kernels, and a vLLM inference plugin to construct high-performance hybrid models. Experiments demonstrate that a 32B Hybrid GKA model, trained on only 0.5% of the source model’s pretraining tokens, outperforms Qwen3-32B by 3.8 points on long-context tasks and achieves up to 2.3Γ— higher decoding throughput. The authors open-source both models and toolchains to facilitate large-scale, fair comparisons of SSM layers.
πŸ“ Abstract
Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKA>GDN>Mamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.
Problem

Research questions and friction points this paper is trying to address.

Hybrid State-Space Models
Architecture Design
Pre-training
Knowledge Transfer
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Priming
Hybrid State-Space Models
Knowledge Transfer
Long-Context Reasoning
Efficient Decoding
πŸ”Ž Similar Papers
No similar papers found.