π€ AI Summary
This work addresses the challenges of deploying large language modelβdriven multi-step agent applications, which include component heterogeneity, dynamic control flow, state persistence, and unpredictable latency. To tackle these issues, the authors propose a ground-up agent-serving framework that decouples workflow definition from execution. The design features lightweight stubs preserving full Python expressiveness, a managed state layer separating logical and physical states, and a two-tier adaptive control architecture combining global policies with local event-driven decisions. The framework supports dependency- and context-aware futures along with adaptive routing and scheduling. Experimental evaluation across three agent workloads demonstrates significant improvements: tail latency is reduced by 34%β74%, peak throughput increases by up to 2.9Γ, the system sustains 80 requests per second, scales to 130,000 concurrent futures, and maintains control overhead below 500 milliseconds.
π Abstract
LLM-driven agentic applications increasingly automate complex, multi-step tasks, but serving them efficiently remains challenging due to heterogeneous components, dynamic and model-driven control flow, long-running state, and unpredictable latencies. Nalar is a ground-up agent-serving framework that cleanly separates workflow specification from execution while providing the runtime visibility and control needed for robust performance. Nalar preserves full Python expressiveness, using lightweight auto-generated stubs that turn agent and tool invocations into futures carrying dependency and context metadata. A managed state layer decouples logical state from physical placement, enabling safe reuse, migration, and consistent retry behavior. A two-level control architecture combines global policy computation with local event-driven enforcement to support adaptive routing, scheduling, and resource management across evolving workflows. Together, these mechanisms allow Nalar to deliver scalable, efficient, and policy-driven serving of heterogeneous agentic applications without burdening developers with orchestration logic. Across three agentic workloads, Nalar cuts tail latency by 34--74\%, achieves up to $2.9\times$ speedups, sustains 80 RPS where baselines fail, and scales to 130K futures with sub-500 ms control overhead.