Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the high memory overhead and access latency caused by linear growth of KV caches with sequence length when deploying large language models (LLMs) on edge devices, this paper proposes a hardware–software co-designed eDRAM-aware KV cache optimization framework. Methodologically, it integrates fine-grained cache eviction, locality-aware memory management, dynamic refresh control, and a lightweight recomputation mechanism to achieve efficient cache compression and reuse while preserving data integrity. Its key contribution lies in being the first to deeply incorporate eDRAM’s physical characteristics—including refresh overhead and bandwidth–energy trade-offs—into KV cache management policies. Experimental results demonstrate that, compared to baseline approaches, the framework achieves a 3.9× inference speedup and a 4.5× improvement in energy efficiency, significantly enhancing low-latency, low-power LLM inference capabilities at the edge.

Technology Category

Application Category

📝 Abstract

Running Large Language Models (LLMs) on edge devices is crucial for reducing latency, improving real-time processing, and enhancing privacy. By performing inference directly on the device, data does not need to be sent to the cloud, ensuring faster responses and reducing reliance on network connectivity. However, implementing LLMs on edge devices presents challenges, particularly with managing key-value (KV) caches, which plays a pivotal role in LLM serving. As the input text lengthens, the size of the KV cache increases linearly with the sequence length, leading to a significant memory footprint and data access costs. On the other hand, edge devices have limited memory and computational power, making it hard to store and efficiently access the large caches needed for LLM inference. To mitigate the substantial overhead caused by KV cache, we propose using embedded DRAM (eDRAM) as the primary storage for LLM serving in edge device, which offers higher storage density compared to SRAM. However, to ensure data integrity, eDRAM needs periodic refresh operations, which are power-intensive. To reduce eDRAM costs and improve overall system performance, we propose~ extit{Kelle}, a software-hardware co-design solution optimized for deploying LLMs on eDRAM-based edge systems. Combined with our fine-grained memory eviction, recomputation, and refresh control algorithms, the extit{Kelle} accelerator delivers a $3.9 imes$ speedup and $4.5 imes$ energy savings compared to existing baseline solutions.

Problem

Research questions and friction points this paper is trying to address.

Optimizing KV cache management for edge LLM serving

Reducing memory footprint of KV caches on resource-constrained devices

Mitigating eDRAM refresh overhead while maintaining data integrity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-design KV caching with eDRAM for edge LLM serving

Use fine-grained memory eviction and recomputation algorithms

Implement refresh control to reduce eDRAM power costs

🔎 Similar Papers

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving