VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Traditional LLM serving systems rely on generic architectures that struggle to achieve efficient optimization across diverse models, workloads, and hardware configurations. This work proposes VibeServe, the first end-to-end LLM serving stack synthesis system driven by multi-agent collaboration. By integrating an outer-loop design space exploration with an inner-loop implementation validation in a closed feedback loop, VibeServe enables scenario-specific specialization during code generation. In standard deployment settings, VibeServe matches the performance of highly optimized systems like vLLM, while significantly outperforming existing solutions across six classes of non-standard scenarios—including unconventional model architectures and custom hardware. This study demonstrates, for the first time, the feasibility and advantages of employing AI agents to autonomously generate complete LLM serving infrastructures.

📝 Abstract

For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation-time specialization rather than runtime generality. Code is available at https://github.com/uw-syfi/vibe-serve.

Problem

Research questions and friction points this paper is trying to address.

LLM serving

bespoke systems

system specialization

non-standard scenarios

infrastructure software

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic loop

bespoke LLM serving

generation-time specialization