Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottlenecks in current agent-based reasoning systems, where tool invocation leads to high first-token latency, low KV cache hit rates, and underutilization of intra-request parallelism. To overcome these limitations, the authors propose a co-designed architecture that integrates the agent orchestrator with the LLM inference engine, eliminating their traditional black-box separation. The approach introduces three key mechanisms: tool-aware prompt segmentation, streaming tool execution, and semantic-aware cache management, enabling cross-layer optimization. Implemented atop vLLM and evaluated on A100 GPUs, the system achieves a 15% reduction in median first-token latency and a 10% decrease in end-to-end latency, offering the first systematic mitigation of latency issues in agent-augmented LLM inference.

Technology Category

Application Category

📝 Abstract
Agentic applications are LLMs that iteratively invoke external tools to accomplish complex tasks. Such tool-based agents are rapidly becoming the dominant paradigm for deploying language models in production. Unlike traditional single-turn inference, agentic workloads chain together multiple LLM calls and tool executions before producing a final response, creating a new performance bottleneck that manifests as increased latency in First Token Rendered (FTR) of the final answer. Through analysis of synthetic requests at production scale, we reveal three critical challenges: tool calls account for 30-80% of FTR latency, KV cache hit rates collapse despite substantial context reuse across iterations, and sequential orchestration wastes potential intra-request parallelism by sequentially executing LLM calls and tools. These bottlenecks stem from a design gap in which orchestrators and LLM engines operate as decoupled black boxes, preventing cross-layer optimizations. We present SUTRADHARA, a co-designed agentic inference system that integrates orchestration with LLM serving through a thin API enabling three optimizations: overlap tool execution with subsequent LLM prefill using tool-aware prompt splitting, streaming tool execution to dispatch tools incrementally during decode rather than waiting for complete output, and orchestrator-aware cache management that uses semantic hints to improve hit rates and reduce thrashing. Implemented on vLLM, SUTRADHARA reduces median FTR latency by 15% and end-to-end latency by 10% across workloads on A100 GPUs, demonstrating that co-design can systematically tame latency in agentic systems.
Problem

Research questions and friction points this paper is trying to address.

agentic inference
tool-based agents
latency bottleneck
First Token Rendered
orchestrator-engine co-design
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic inference
orchestrator-engine co-design
tool-aware prompt splitting
streaming tool execution
orchestrator-aware caching
🔎 Similar Papers
No similar papers found.