Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the prohibitively high inference cost of LLM-based agents in complex workflows—caused by frequent, redundant planning—the paper introduces Runtime Plan Caching. This mechanism automatically extracts structured plan templates at inference time and enables environment-aware template retrieval and context-adaptive reuse via keyword matching and lightweight fine-tuning, thereby overcoming traditional semantic caching’s reliance on static dialogue contexts. As the first plan-caching paradigm specifically designed for LLM agents, it supports plug-and-play deployment without modifying the underlying model or inference pipeline. Evaluated across multiple real-world agent applications, Runtime Plan Caching reduces average inference cost by 46.62% with zero performance degradation, while remaining fully compatible with existing LLM serving infrastructure.

Technology Category

Application Category

📝 Abstract

LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental contexts. We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates from planning stages of agentic applications across semantically similar tasks to reduce the cost of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agentic applications shows that our system can reduce costs by 46.62% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.

Problem

Research questions and friction points this paper is trying to address.

Reduce high costs of LLM agentic applications

Improve caching for agentic workflows

Enable efficient plan reuse across tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts and reuses structured plan templates

Uses keyword extraction for request matching

Adapts templates with lightweight models

🔎 Similar Papers

No similar papers found.