🤖 AI Summary
To mitigate timing side-channel attacks arising from global KV cache sharing in large language model (LLM) inference, this work proposes a selective cache-sharing mechanism that enhances inference efficiency while preserving input privacy. Methodologically, it integrates multi-level privacy detection—comprising rule-based matching, a general-purpose detector, and context-aware validation—with a unified radix-tree–based indexing structure and entropy-driven dynamic access monitoring, enabling fine-grained separation and real-time protection of sensitive versus non-sensitive cache entries. Experiments demonstrate that the approach mitigates 94%–97% of timing side-channel attacks; compared to full cache isolation, it reduces first-token latency by 40.58% and improves throughput by 2.66×. The core contribution lies in the first co-design of privacy-aware cache management and efficient KV cache sharing—achieving a principled balance between security guarantees and system performance.
📝 Abstract
Global KV-cache sharing has emerged as a key optimization for accelerating large language model (LLM) inference. However, it exposes a new class of timing side-channel attacks, enabling adversaries to infer sensitive user inputs via shared cache entries. Existing defenses, such as per-user isolation, eliminate leakage but degrade performance by up to 38.9% in time-to-first-token (TTFT), making them impractical for high-throughput deployment. To address this gap, we introduce SafeKV (Secure and Flexible KV Cache Sharing), a privacy-aware KV-cache management framework that selectively shares non-sensitive entries while confining sensitive content to private caches. SafeKV comprises three components: (i) a hybrid, multi-tier detection pipeline that integrates rule-based pattern matching, a general-purpose privacy detector, and context-aware validation; (ii) a unified radix-tree index that manages public and private entries across heterogeneous memory tiers (HBM, DRAM, SSD); and (iii) entropy-based access monitoring to detect and mitigate residual information leakage. Our evaluation shows that SafeKV mitigates 94% - 97% of timing-based side-channel attacks. Compared to per-user isolation method, SafeKV improves TTFT by up to 40.58% and throughput by up to 2.66X across diverse LLMs and workloads. SafeKV reduces cache-induced TTFT overhead from 50.41% to 11.74% on Qwen3-235B. By combining fine-grained privacy control with high cache reuse efficiency, SafeKV reclaims the performance advantages of global sharing while providing robust runtime privacy guarantees for LLM inference.