Nalar: An agent serving framework

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenges of deploying large language model–driven multi-step agent applications, which include component heterogeneity, dynamic control flow, state persistence, and unpredictable latency. To tackle these issues, the authors propose a ground-up agent-serving framework that decouples workflow definition from execution. The design features lightweight stubs preserving full Python expressiveness, a managed state layer separating logical and physical states, and a two-tier adaptive control architecture combining global policies with local event-driven decisions. The framework supports dependency- and context-aware futures along with adaptive routing and scheduling. Experimental evaluation across three agent workloads demonstrates significant improvements: tail latency is reduced by 34%–74%, peak throughput increases by up to 2.9×, the system sustains 80 requests per second, scales to 130,000 concurrent futures, and maintains control overhead below 500 milliseconds.

Technology Category

Application Category

📝 Abstract

LLM-driven agentic applications increasingly automate complex, multi-step tasks, but serving them efficiently remains challenging due to heterogeneous components, dynamic and model-driven control flow, long-running state, and unpredictable latencies. Nalar is a ground-up agent-serving framework that cleanly separates workflow specification from execution while providing the runtime visibility and control needed for robust performance. Nalar preserves full Python expressiveness, using lightweight auto-generated stubs that turn agent and tool invocations into futures carrying dependency and context metadata. A managed state layer decouples logical state from physical placement, enabling safe reuse, migration, and consistent retry behavior. A two-level control architecture combines global policy computation with local event-driven enforcement to support adaptive routing, scheduling, and resource management across evolving workflows. Together, these mechanisms allow Nalar to deliver scalable, efficient, and policy-driven serving of heterogeneous agentic applications without burdening developers with orchestration logic. Across three agentic workloads, Nalar cuts tail latency by 34--74\%, achieves up to $2.9\times$ speedups, sustains 80 RPS where baselines fail, and scales to 130K futures with sub-500 ms control overhead.

Problem

Research questions and friction points this paper is trying to address.

agent serving

LLM-driven agentic applications

heterogeneous components

dynamic control flow

unpredictable latencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

agent-serving framework

LLM-driven agents

managed state layer