🤖 AI Summary
Large language models (LLMs) suffer from quadratic time and memory complexity during inference due to multi-head attention (MHA), alongside substantial KV cache overhead. Method: This paper proposes Apriel-H1, a hybrid architecture that progressively replaces selected MHA layers in Transformer decoders with linear-complexity state space models (SSMs), enabling the first efficient integration of Mamba-style SSMs with MHA at the 15B parameter scale. Optimization employs joint incremental knowledge distillation and supervised fine-tuning. Models are deployed on vLLM for evaluation. Results: Multiple Apriel-H1-15B variants achieve over 2× higher inference throughput with negligible quality degradation, while robustly supporting long-context generation, high-concurrency agent workloads, and large-scale production deployment. The core contribution is the empirical validation—under realistic large-model conditions—of both the feasibility and practical utility of SSM-Transformer hybrid architectures.
📝 Abstract
Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.