EmbAdvisor: Adaptive Cache Management for Sustainable LLM Serving

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work exposes the long-overlooked yet increasingly critical embodied carbon footprint of KV caching in LLM inference services: while KV caching reduces operational carbon emissions, its reliance on high-capacity, high-speed SSDs substantially increases manufacturing-phase embodied carbon—exacerbated as model scale grows. We propose the first adaptive cache management framework targeting end-to-end carbon footprint minimization across the hardware lifecycle. Our approach innovatively models the coupling between operational carbon (from compute and memory access) and embodied carbon (from SSD fabrication and deployment). We design a carbon-intensity-aware joint hotness–energy analysis method and formulate a Service-Level Objective (SLO)-constrained, carbon-aware Integer Linear Programming (ILP) optimization model. Evaluated on Llama-3 70B inference, our framework achieves an average carbon reduction of 9.5%, up to 31.2% under low-carbon-grid conditions, while strictly meeting latency SLOs.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become widely used, their environmental impact$unicode{x2014}$especially carbon emissions$unicode{x2014}$has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present EmbAdvisor, a carbon-aware caching framework that selects the optimal cache size for LLM serving. EmbAdvisor profiles different LLM tasks and uses an Integer Linear Programming (ILP) solver to select cache sizes that meet SLOs while minimizing total carbon emissions. Overall, EmbAdvisor reduces the average carbon emissions of a Llama-3 70B model by 9.5% under various carbon intensities compared to a non-adaptive cache scenario, and can save up to 31.2% when the carbon intensity is low.

Problem

Research questions and friction points this paper is trying to address.

Balancing operational and embodied carbon in LLM caching

Optimizing cache size for sustainable LLM serving

Reducing carbon emissions via adaptive KV cache management

Innovation

Methods, ideas, or system contributions that make the work stand out.

Carbon-aware caching framework for LLMs

ILP solver optimizes cache size selection

Reduces carbon emissions by 9.5-31.2%

🔎 Similar Papers

No similar papers found.

Authors to Follow