What Limits Agentic Systems Efficiency?

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Prior work on agent systems predominantly focuses on reasoning performance, overlooking end-to-end efficiency bottlenecks. This paper identifies—through systematic empirical analysis—that Web-environment latency (including LLM API calls and network round trips) accounts for up to 53.7% of total response time. To address this, we propose SpecCache, a lightweight optimization framework that synergistically integrates speculative execution with caching: it predicts user intent and proactively prefetches or caches high-frequency Web interaction results, thereby reducing redundant request overhead. Evaluated across 15 mainstream LLMs and 5 commercial API providers on two standard benchmarks, SpecCache achieves up to a 58× improvement in cache hit rate, reduces Web-environment overhead by 3.2×, and significantly lowers end-to-end latency—all without compromising task accuracy.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated strong reasoning capabilities. To further enhance LLM capabilities, recent agentic systems, such as Deep Research, incorporate web interactions into LLM reasoning to mitigate uncertainties and reduce potential errors. However, existing research predominantly focuses on reasoning performance, often neglecting the efficiency of agentic systems. In this work, we present a comprehensive empirical study that identifies efficiency bottlenecks in web-interactive agentic systems. We decompose end-to-end latency into two primary components: LLM API latency and web environment latency. We conduct a comprehensive empirical study across 15 models and 5 providers to demonstrate high variability in API-based agentic systems. We observe that web environment latency can contribute as much as 53.7% to the overall latency in a web-based agentic system. To improve latency, we propose SpecCache, a caching framework augmented with speculative execution that can reduce web environment overhead. Extensive evaluations on two standard benchmarks show that our approach improves the cache hit rate by up to 58x compared to a random caching strategy, while reducing web environment overhead by up to 3.2x, without degrading agentic system performance.

Problem

Research questions and friction points this paper is trying to address.

Identifies efficiency bottlenecks in web-interactive agentic systems

Analyzes latency components between LLM APIs and web environments

Proposes caching framework to reduce web environment overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes SpecCache caching framework with speculative execution

Reduces web environment latency by up to 3.2 times

Improves cache hit rate by 58 times over random caching

🔎 Similar Papers

Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends