Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

📅 2024-06-11
🏛️ arXiv.org
📈 Citations: 31
Influential: 5
📄 PDF
🤖 AI Summary
To address the quadratic computational complexity and limited length extrapolation in long-context modeling, this paper proposes Samba—a hybrid architecture that synergistically couples selective state space models (SSMs) and sliding window attention (SWA) across layers. It introduces the first lightweight SSM-SWA co-design mechanism, achieving linear-time complexity, precise recency-aware memory retention, and infinite-context compression capability. Evaluated at 3.8B parameters, Samba achieves zero-shot generalization to 1M-context sequences with significantly reduced perplexity; attains 100% recall on the 256K-passkey retrieval task; outperforms full-attention baselines on phonebook retrieval extrapolation; and sustains 3.73× higher throughput than Transformers on 128K-token prompts. This work provides the first empirical validation that principled integration of SSMs and local attention can jointly optimize inference efficiency, modeling capacity, and generalization across extreme context lengths.

Technology Category

Application Category

📝 Abstract
Efficiently modeling sequences with infinite context length has long been a challenging problem. Previous approaches have either suffered from quadratic computational complexity or limited extrapolation ability in length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall recent memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and demonstrate that it significantly outperforms state-of-the-art models across a variety of benchmarks. Pretrained on sequences of 4K length, Samba shows improved perplexity in context lengths of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task, and exhibits superior retrieval extrapolation on the challenging Phonebook task compared to full-attention models. As a linear-time sequence model, Samba achieves a 3.73x higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our code for training on open source data is publicly available at https://github.com/microsoft/Samba.
Problem

Research questions and friction points this paper is trying to address.

Efficiently model sequences with infinite context length.
Overcome quadratic complexity and limited length generalization.
Achieve superior performance and memory recall in long contexts.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-SWA architecture for efficient modeling
Linear-time sequence model with high throughput
Scalable to 3.8B parameters, 3.2T training tokens
🔎 Similar Papers
No similar papers found.