🤖 AI Summary
This work addresses the high end-to-end latency of large language model (LLM) agents on edge devices. The authors propose Agent-X, the first purely software-based, accuracy-preserving acceleration framework for on-device LLM agents. Agent-X reduces latency by rewriting prompts to activate prefix caching tailored to agent input patterns and introduces a novel speculative decoding mechanism that operates without involving the full LLM, thereby jointly optimizing both the prefill and decoding phases. This study is the first to systematically identify and eliminate latency bottlenecks in on-device AI agents, achieving a 1.61× end-to-end speedup on representative tasks in real-world systems while seamlessly integrating into existing agent architectures.
📝 Abstract
LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.