🤖 AI Summary
Existing LLM inference systems are prone to security and reliability issues under high concurrency due to shared state—such as KV caches, batching, and multi-tenant scheduling—that evade detection by conventional testing. This work proposes GRIEF, the first gray-box fuzzing framework that treats concurrent request traces as first-class inputs. GRIEF employs lightweight instrumentation to detect crashes, hangs, performance anomalies, and silent errors, and combines controlled replay with log-probability validation to confirm reproducible failures. The approach uncovers previously unknown vulnerability classes, including cross-request contamination, implicit denial-of-service, and delayed crashes, identifying 15 bugs in vLLM and SGLang (10 confirmed, including two CVEs). These findings establish concurrent service behavior as a critical security boundary for LLM infrastructure.
📝 Abstract
LLM inference and serving systems have become security-critical infrastructure; however, many of their most concerning failures arise from the serving layer rather than from model behavior alone. Modern inference engines combine KV cache, batching, prefix sharing, speculative decoding, adapters, and multi-tenant scheduling, creating shared-state behavior that only emerges under realistic concurrent workloads and is missed by standard model, safety, and API tests. We present GRIEF, a greybox fuzzer for LLM inference engines that treats timed multi-request traces as first-class inputs, uses lightweight oracles to detect crashes, hangs, performance pathologies, and silent output corruption, and applies controlled replay with log-probability checks to confirm reproducible serving-layer failures. Across early campaigns on vLLM and SGLang, GRIEF discovers 15 vulnerabilities, 10 confirmed by engine developers, including 2 CVEs, spanning KV-cache isolation failures, cross-request performance interference, and crash or liveness bugs. These results show that concurrency, caching, and state reuse can induce silent cross-request contamination, noisy-neighbor denial of service, and delayed crashes without malformed inputs or explicit server errors, making concurrent serving behavior a first-class security and reliability boundary for LLM infrastructure.