ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the GPU memory bottleneck caused by linear growth of the Latent-Cache in DeepSeek-V3.2-Exp during long-context decoding—severely limiting batch size and throughput—this paper proposes an offloading-centric heterogeneous cache management architecture. We introduce the first CPU-GPU collaborative dynamic offloading mechanism tailored for sparse attention models, decoupling batch size from GPU memory constraints. Integrated with a low-overhead cache residency policy and PD-decoupled optimization, our approach enables selective migration and efficient reuse of the Latent-Cache. Experimental results demonstrate a 69.4% improvement in decoding throughput at 32K context length and a 123% improvement at 128K, significantly reducing deployment costs for large language models serving long-context workloads.

Technology Category

Application Category

📝 Abstract
DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput. To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby reducing deployment costs in real-world settings. Our high-fidelity simulations show that ESS delivers 69.4% throughput improvement at 32K context length and up to 123% throughput improvement at 128K, demonstrating its effectiveness for large-context inference workloads. These results highlight ESS as a practical and scalable solution for long-context LLM serving.
Problem

Research questions and friction points this paper is trying to address.

Addresses GPU memory bottleneck in DeepSeek-V3.2-Exp's decode stage
Decouples batch-size scaling from GPU memory constraints via offloading
Improves throughput for large-context inference workloads cost-effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offloads Latent-Cache to CPU memory to free GPU capacity
Decouples batch-size scaling from GPU memory constraints
Selectively preserves latency-critical components on GPU
X
Xinhang Chen
Baige AI Team, Baidu Inc.
C
Chao Zhang
Baige AI Team, Baidu Inc.
J
Jiahuan He
Baige AI Team, Baidu Inc.
W
Wei Liu
Baige AI Team, Baidu Inc.
J
Jianming Zhang
Baige AI Team, Baidu Inc.
W
Wenlong Zhou
Baige AI Team, Baidu Inc.
X
Xiao Li
Baige AI Team, Baidu Inc.
P
Pai Zeng
Baige AI Team, Baidu Inc.
Shiyong Li
Shiyong Li
Beijing Institute of Technology
Microwave imaging
Y
Yuanpan Qian
Baige AI Team, Baidu Inc.
D
Dong Li
Baige AI Team, Baidu Inc.
Z
Zhaogeng Li
Baige AI Team, Baidu Inc.