Building Interactive Real-Time Agents with Asynchronous I/O and Speculative Tool Calling

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
This work addresses the high latency of large language models in multi-turn tool-augmented interactions, which stems from synchronous waiting and impedes sub-second responsiveness required by applications such as voice assistants. To overcome this, the authors propose an asynchronous I/O architecture that decouples model inference from external tool calls and introduce speculative tool invocation to handle input uncertainty. For resource-constrained edge deployment with smaller models, they design a clock-driven streaming training approach and a synthetic-data-supervised fine-tuning strategy. The proposed framework achieves real-time responsiveness for complex tool-using agents on both cloud and edge devices for the first time, yielding 1.6–2.2× speedup on models like Qwen2.5-3B and Llama-3.2-3B (1.3–1.7× on cloud), with only marginal degradation in accuracy.
📝 Abstract
There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency responsiveness is required; for example, with voice-controlled applications, under 1 second of latency is typically required for the interaction to feel seamless. However, if we want the LLM to reason and execute an agentic workflow with tool calling, this can add can add several seconds or more of latency, which is prohibitive for real-time latency-sensitive applications. In our work, we aim to enable real-time interaction even for agents with complex multi-turn tool calling. We propose Asynchronous I/O, which decouples the core agent reason-and-act thread from waiting for additional information from either the user or environment, thereby allowing for overlapping agentic processing while waiting on external delays. We also propose Speculative Tool Calling as a method to manage task execution when the agent is still unsure if it has received the full information or if additional user information may later be provided. For strong cloud models, our method can be applied out-of-the-box to existing real-time cloud APIs, providing 1.3-1.7$\times$ speedups with minor accuracy loss. To enable real-time interaction with small edge-scale models, we also present a clock-based training methodology that adapts the model to handle streaming inputs and asynchronous responses, and demonstrate a synthetic data generation strategy for SFT. Altogether, this approach provides 1.6-2.2$\times$ speedups with the Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models across multiple tool calling benchmarks.
Problem

Research questions and friction points this paper is trying to address.

real-time interaction
low-latency
agentic AI
tool calling
asynchronous I/O
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous I/O
Speculative Tool Calling
Real-Time Agents
Streaming Input Handling
Edge-Scale LLMs
🔎 Similar Papers