AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the challenge of high time-to-first-token (TTFT) latency and unstable interactivity when concurrently serving multiple agents on consumer-grade GPUs, where long prefill requests compete with short decode requests for shared resources. The paper introduces, for the first time, a three-phase model tailored to agent workloads—cold prefill, resumable prefill, and short decode—and proposes a system that isolates prefill and decode phases, dynamically controls budgets for resumable prefill, and employs adaptive resource scheduling based on CUDA Green Context slots. Experimental results demonstrate that this approach achieves up to 2.8× improvement in TTFT and 2.7× reduction in time per output token (TPOT) on consumer GPUs, significantly enhancing latency stability while maintaining high throughput.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append tool outputs to cached contexts, and short decodes, which are latency-critical. This mix intensifies contention compared to conventional chatbot serving. We present AgentServe, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control. Evaluation results show that AgentServe significantly improves latency stability while sustaining competitive throughput, achieving up to 2.8x TTFT improvement and 2.7x TPOT improvement over state-of-the-art baselines across different settings.

Problem

Research questions and friction points this paper is trying to address.

agentic AI

LLM serving

resource contention

head-of-line blocking

consumer-grade GPU

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic AI serving

algorithm-system co-design

CUDA Green Context