ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

To address the GPU memory bottleneck caused by linear growth of the Latent-Cache in DeepSeek-V3.2-Exp during long-context decoding—severely limiting batch size and throughput—this paper proposes an offloading-centric heterogeneous cache management architecture. We introduce the first CPU-GPU collaborative dynamic offloading mechanism tailored for sparse attention models, decoupling batch size from GPU memory constraints. Integrated with a low-overhead cache residency policy and PD-decoupled optimization, our approach enables selective migration and efficient reuse of the Latent-Cache. Experimental results demonstrate a 69.4% improvement in decoding throughput at 32K context length and a 123% improvement at 128K, significantly reducing deployment costs for large language models serving long-context workloads.

Technology Category

Application Category

📝 Abstract

DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput. To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby reducing deployment costs in real-world settings. Our high-fidelity simulations show that ESS delivers 69.4% throughput improvement at 32K context length and up to 123% throughput improvement at 128K, demonstrating its effectiveness for large-context inference workloads. These results highlight ESS as a practical and scalable solution for long-context LLM serving.

Problem

Research questions and friction points this paper is trying to address.

Addresses GPU memory bottleneck in DeepSeek-V3.2-Exp's decode stage

Decouples batch-size scaling from GPU memory constraints via offloading

Improves throughput for large-context inference workloads cost-effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Offloads Latent-Cache to CPU memory to free GPU capacity

Decouples batch-size scaling from GPU memory constraints

Selectively preserves latency-critical components on GPU

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference