🤖 AI Summary
Traditional large language model serving architectures struggle to efficiently support agentic AI workloads characterized by statefulness, multi-turn interactions, and tool invocation. This work constructs an end-to-end tracing system to systematically characterize LLM invocation and tool execution behaviors of ReAct-style agents under both inference and non-inference configurations—the first such analysis to date. The study reveals that agentic workloads are not merely long-prompt scenarios; instead, they are decode-dominated, heavily reliant on persistent key-value (KV) caches, and exhibit a phased tool usage pattern transitioning from exploration to execution. Furthermore, a significant portion of input tokens is reused across multiple turns. These findings provide crucial insights and concrete optimization directions for designing efficient serving systems tailored to agentic AI applications.
📝 Abstract
Agentic AI shifts LLM serving from isolated prompt-generation requests to stateful, multi-turn executions that repeatedly invoke the model, call tools, and grow context over time. This paper characterizes ReAct-style agents from both the LLM-serving and tool-execution perspectives using an end-to-end tracing infrastructure across reasoning and non-reasoning Gemma and Qwen configurations on five agentic benchmarks. Our study shows that agentic workloads are not simply long-prompt workloads: with effective context caching, most input tokens are reused across turns, making execution decode-dominated while increasing dependence on long-lived KV-cache state. We also find that tool use has a clear temporal structure, with agents shifting from read/explore behavior early in execution to execute/write behavior later. These results show that efficient agentic serving must jointly manage repeated model re-entry, persistent context state, and workload-dependent tool behavior.