Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

215K/year
πŸ€– AI Summary
This work addresses the high computational and memory costs in long-context reasoning, where the full prompt is cached across all layers during prefilling and repeatedly accessed during decoding. The authors propose SPEED, a novel approach that introduces the first layer-asymmetric KV visibility mechanism: KV states for prompt tokens are retained only in shallow layers, while higher layers entirely discard their visibility, preserving only a few Beginning-of-Sequence (BoS) anchor tokens. Combined with selective KV instantiation and instruction fine-tuning on Llama-3.1-8B, SPEED processes prompts using only 75% of the model’s layers in 128K-context settings. This yields an OLMES average score of 51.2 (vs. baseline 51.4), reduces time-to-first-token (TTFT) by 33%, decreases time-per-output-token (TPOT) by 22%, and lowers active KV memory usage by 25.0%.
πŸ“ Abstract
Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode} (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.
Problem

Research questions and friction points this paper is trying to address.

long-context inference
KV cache
decoder-only language models
prefill
autoregressive decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

layer-asymmetric KV visibility
long-context inference
KV cache optimization
Shallow Prefill Deep Decode
efficient decoding
πŸ”Ž Similar Papers
J
Jungsuk Oh
Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea
H
Hyeseo Jeon
Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea
H
Hyunjune Ji
Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea
K
Kyongmin Kong
Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea
Jay-Yoon Lee
Jay-Yoon Lee
Seoul National University
Machine LearningArtificial IntelligenceKnowledge InjectionStructured prediction