SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

248K/year
🤖 AI Summary
Industrial-scale LLM inference faces severe on-chip memory pressure due to quadratic growth of KV caches with sequence length—particularly challenging in static-graph, continuous-batching systems (e.g., vLLM, SGLang) where native attention mechanisms cannot be readily modified. Method: This work introduces the first systematic integration of sparse KV attention—unifying StreamingLLM and SnapKV—into production-grade inference engines, enabling efficient KV cache compression and dataflow acceleration without altering model weights or decoding logic. Results: Evaluated on Llama-3.1-8B-Instruct, DeepSeek-R1, and 16-way tensor-parallel DeepSeek-671B, our approach supports 128K-context inference at 1832 tokens/s throughput, reduces KV memory usage by 4×, and incurs <0.5% accuracy degradation on LongBench-v2, AIME24, and LiveCodeBench—significantly enhancing deployment efficiency and practicality of ultra-long-context LLMs.

Technology Category

Application Category

📝 Abstract
The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4 imes$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.
Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache memory usage for large context LLMs
Maintaining model accuracy while compressing attention key-value caches
Enabling efficient sparse KV attention in production inference systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression for efficient long sequence decoding
Deployed in static graph systems with continuous batching
Minimal accuracy degradation with improved memory usage
🔎 Similar Papers
J
Jonathan Li
SambaNova Systems, Inc.
Nasim Farahini
Nasim Farahini
SambaNova Systems, Inc.
E
Evgenii Iuliugin
SambaNova Systems, Inc.
M
Magnus Vesterlund
SambaNova Systems, Inc.
C
Christian Haggstrom
SambaNova Systems, Inc.
G
Guangtao Wang
SambaNova Systems, Inc.
S
Shubhangi Upasani
SambaNova Systems, Inc.
A
Ayush Sachdeva
Cartesia AI
R
Rui Li
SambaNova Systems, Inc.
F
Faline Fu
SambaNova Systems, Inc.
C
Chen Wu
SambaNova Systems, Inc.
A
A. Siddiqua
SambaNova Systems, Inc.
J
John Long
SambaNova Systems, Inc.
T
Tuowen Zhao
SambaNova Systems, Inc.
M
Matheen Musaddiq
SambaNova Systems, Inc.
H
Hakan Zeffer
SambaNova Systems, Inc.
Y
Yun Du
SambaNova Systems, Inc.
M
Mingran Wang
SambaNova Systems, Inc.
Qinghua Li
Qinghua Li
Professor, University of Arkansas
CybersecurityPrivacyArtificial IntelligencePower Grids
B
Bo Li
SambaNova Systems, Inc.
U
Urmish Thakker
SambaNova Systems, Inc.
Raghu Prabhakar
Raghu Prabhakar
SambaNova Systems, Inc.