🤖 AI Summary
This work addresses the significant throughput degradation in batched inference of large language model (LLM) agents caused by mid-run memory pressure from accumulated key-value (KV) caches. To mitigate this issue, the study introduces a lightweight, cache-aware concurrency control mechanism inspired by congestion control principles from distributed systems. The proposed approach dynamically regulates the number of active agents at runtime using real-time KV cache signals, enabling proactive admission control to manage cache pressure before it degrades performance. Designed to be compatible with existing LLM serving frameworks, the method achieves up to 4.09× and 1.9× throughput improvements on Qwen3-32B and DeepSeek-V3, respectively, effectively alleviating mid-run thrashing without requiring architectural modifications.
📝 Abstract
Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe throughput degradation well before memory capacity is exhausted. We identify this phenomenon as middle-phase thrashing, a previously under-characterized pathology in which cache efficiency collapses as long-lived agents accumulate state over time. We argue that mitigating this pathology requires moving beyond reactive, request-level cache management to proactive, agent-level admission control. Drawing inspiration from congestion control in distributed systems, we view the KV cache as a shared resource whose efficient utilization depends on feedback-driven regulation. Based on this insight, we present CONCUR, a lightweight control layer that regulates agent admission to bound aggregate cache pressure while preserving execution continuity. CONCUR adapts a cache-aware control algorithm to dynamically adjust the number of active agents using runtime cache signals. Across large models and real-world agent workloads, CONCUR prevents middle-phase thrashing and improves batch inference throughput by up to 4.09x on Qwen3-32B and 1.9x on DeepSeek-V3, while remaining compatible with existing LLM serving systems.