Nalar: An agent serving framework

πŸ“… 2026-01-08
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges of deploying large language model–driven multi-step agent applications, which include component heterogeneity, dynamic control flow, state persistence, and unpredictable latency. To tackle these issues, the authors propose a ground-up agent-serving framework that decouples workflow definition from execution. The design features lightweight stubs preserving full Python expressiveness, a managed state layer separating logical and physical states, and a two-tier adaptive control architecture combining global policies with local event-driven decisions. The framework supports dependency- and context-aware futures along with adaptive routing and scheduling. Experimental evaluation across three agent workloads demonstrates significant improvements: tail latency is reduced by 34%–74%, peak throughput increases by up to 2.9Γ—, the system sustains 80 requests per second, scales to 130,000 concurrent futures, and maintains control overhead below 500 milliseconds.

Technology Category

Application Category

πŸ“ Abstract
LLM-driven agentic applications increasingly automate complex, multi-step tasks, but serving them efficiently remains challenging due to heterogeneous components, dynamic and model-driven control flow, long-running state, and unpredictable latencies. Nalar is a ground-up agent-serving framework that cleanly separates workflow specification from execution while providing the runtime visibility and control needed for robust performance. Nalar preserves full Python expressiveness, using lightweight auto-generated stubs that turn agent and tool invocations into futures carrying dependency and context metadata. A managed state layer decouples logical state from physical placement, enabling safe reuse, migration, and consistent retry behavior. A two-level control architecture combines global policy computation with local event-driven enforcement to support adaptive routing, scheduling, and resource management across evolving workflows. Together, these mechanisms allow Nalar to deliver scalable, efficient, and policy-driven serving of heterogeneous agentic applications without burdening developers with orchestration logic. Across three agentic workloads, Nalar cuts tail latency by 34--74\%, achieves up to $2.9\times$ speedups, sustains 80 RPS where baselines fail, and scales to 130K futures with sub-500 ms control overhead.
Problem

Research questions and friction points this paper is trying to address.

agent serving
LLM-driven agentic applications
heterogeneous components
dynamic control flow
unpredictable latencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

agent-serving framework
LLM-driven agents
managed state layer
two-level control architecture
future-based execution
πŸ”Ž Similar Papers
No similar papers found.