The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems

📅 2024-09-30
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies and systematically exploits a novel timing side channel in large language model (LLM) services, arising from shared GPU caches—specifically KV caches and semantic caches—that enables inference of system prompts and other users’ prompts in multi-tenant settings. We propose a token-level prefix search algorithm that integrates cache hit classification with fine-grained timing measurements to recover private prompts with high accuracy under black-box conditions. We empirically validate the attack across multiple mainstream online LLM services, successfully reconstructing both system and user prompts, thereby demonstrating practical threat relevance. To our knowledge, this is the first work to formally model cache-induced timing leakage as an exploitable side channel in LLM inference systems. Our findings catalyze a paradigm shift in LLM infrastructure security, urging prompt-privacy-aware defenses against cache-based side-channel threats.

Technology Category

Application Category

📝 Abstract
The wide deployment of Large Language Models (LLMs) has given rise to strong demands for optimizing their inference performance. Today's techniques serving this purpose primarily focus on reducing latency and improving throughput through algorithmic and hardware enhancements, while largely overlooking their privacy side effects, particularly in a multi-user environment. In our research, for the first time, we discovered a set of new timing side channels in LLM systems, arising from shared caches and GPU memory allocations, which can be exploited to infer both confidential system prompts and those issued by other users. These vulnerabilities echo security challenges observed in traditional computing systems, highlighting an urgent need to address potential information leakage in LLM serving infrastructures. In this paper, we report novel attack strategies designed to exploit such timing side channels inherent in LLM deployments, specifically targeting the Key-Value (KV) cache and semantic cache widely used to enhance LLM inference performance. Our approach leverages timing measurements and classification models to detect cache hits, allowing an adversary to infer private prompts with high accuracy. We also propose a token-by-token search algorithm to efficiently recover shared prompt prefixes in the caches, showing the feasibility of stealing system prompts and those produced by peer users. Our experimental studies on black-box testing of popular online LLM services demonstrate that such privacy risks are completely realistic, with significant consequences. Our findings underscore the need for robust mitigation to protect LLM systems against such emerging threats.
Problem

Research questions and friction points this paper is trying to address.

Detect timing side channels in LLM systems
Exploit vulnerabilities in shared caches and GPU memory
Protect LLM systems from privacy leakage threats
Innovation

Methods, ideas, or system contributions that make the work stand out.

Timing side channels detection
Cache hits classification models
Token-by-token search algorithm
🔎 Similar Papers
No similar papers found.
L
Linke Song
Institute of Information Engineering, CAS; School of Cyber Security, University of Chinese Academy of Sciences
Z
Zixuan Pang
University of Science and Technology of China
W
Wenhao Wang
Institute of Information Engineering, CAS; School of Cyber Security, University of Chinese Academy of Sciences
Z
Zihao Wang
Indiana University Bloomington
X
Xiaofeng Wang
Indiana University Bloomington
H
Hongbo Chen
Indiana University Bloomington
W
Wei Song
Institute of Information Engineering, CAS; School of Cyber Security, University of Chinese Academy of Sciences
Yier Jin
Yier Jin
Associate Professor, University of Florida
hardware securitycyber securitysecurity in Internet of Thingsformal verification
Dan Meng
Dan Meng
OPPO
Rui Hou
Rui Hou
Member of Technical Staff, xAI
Large Language ModelReasoning