A Policy-Driven Runtime Layer for Agentic LLM Serving

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of unified support for cross-layer policies—such as caching, batching, and security—in current large language model (LLM) service stacks, which often leads to ad-hoc patching in multi-agent systems. To resolve this, the authors propose the first agent-centric intermediate runtime architecture that introduces a unified runtime layer between the agent framework and the inference engine. This layer coordinates cross-layer policies using four primitives—observe, score, predict, and act—anchored to agent identities. The design enables plug-in integration and incorporates an online learning–based state transition matrix alongside CacheSage, a KV cache eviction and prefetching mechanism informed by cache survival rates. Evaluated on five real-world multi-agent workloads, the approach improves cache hit rates by 13–37 percentage points, reduces average time-to-first-token (TTFT) latency by 12%–29%, and increases throughput by 6%–14%.
📝 Abstract
Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.
Problem

Research questions and friction points this paper is trying to address.

multi-agent LLM systems
serving stack
cross-cutting policies
agent-awareness
KV caching
Innovation

Methods, ideas, or system contributions that make the work stand out.

agent runtime layer
policy-driven serving
KV caching
multi-agent LLM systems
CacheSage
🔎 Similar Papers