EmbAdvisor: Adaptive Cache Management for Sustainable LLM Serving

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes the long-overlooked yet increasingly critical embodied carbon footprint of KV caching in LLM inference services: while KV caching reduces operational carbon emissions, its reliance on high-capacity, high-speed SSDs substantially increases manufacturing-phase embodied carbon—exacerbated as model scale grows. We propose the first adaptive cache management framework targeting end-to-end carbon footprint minimization across the hardware lifecycle. Our approach innovatively models the coupling between operational carbon (from compute and memory access) and embodied carbon (from SSD fabrication and deployment). We design a carbon-intensity-aware joint hotness–energy analysis method and formulate a Service-Level Objective (SLO)-constrained, carbon-aware Integer Linear Programming (ILP) optimization model. Evaluated on Llama-3 70B inference, our framework achieves an average carbon reduction of 9.5%, up to 31.2% under low-carbon-grid conditions, while strictly meeting latency SLOs.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become widely used, their environmental impact$unicode{x2014}$especially carbon emissions$unicode{x2014}$has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present EmbAdvisor, a carbon-aware caching framework that selects the optimal cache size for LLM serving. EmbAdvisor profiles different LLM tasks and uses an Integer Linear Programming (ILP) solver to select cache sizes that meet SLOs while minimizing total carbon emissions. Overall, EmbAdvisor reduces the average carbon emissions of a Llama-3 70B model by 9.5% under various carbon intensities compared to a non-adaptive cache scenario, and can save up to 31.2% when the carbon intensity is low.
Problem

Research questions and friction points this paper is trying to address.

Balancing operational and embodied carbon in LLM caching
Optimizing cache size for sustainable LLM serving
Reducing carbon emissions via adaptive KV cache management
Innovation

Methods, ideas, or system contributions that make the work stand out.

Carbon-aware caching framework for LLMs
ILP solver optimizes cache size selection
Reduces carbon emissions by 9.5-31.2%
🔎 Similar Papers
No similar papers found.
Y
Yuyang Tian
University of Waterloo, Canada
D
Desen Sun
University of Waterloo, Canada
Y
Yi Ding
Purdue University, USA
Sihang Liu
Sihang Liu
Assistant Professor of School of Computer Science, University of Waterloo
Computer SystemsComputer ArchitectureSustainability