π€ AI Summary
This work addresses the high-cost approximate nearest neighbor (ANN) search challenges in large language modelβbased multi-agent systems, which arise from massive storage demands, frequent memory updates, and concurrent agent coexistence. To tackle this, we propose a hierarchical agent memory system that uniquely integrates multi-level index caching, cross-agent collaborative index management, and GPU-CPU heterogeneous computing acceleration within a unified framework. The design is compatible with mainstream agent platforms such as LangChain and LlamaIndex. Experimental evaluation under realistic multi-agent workloads demonstrates that our system achieves over 4.29Γ higher end-to-end throughput compared to existing approaches, significantly reducing ANN retrieval overhead while enhancing overall system efficiency.
π Abstract
In this work, we identify and address the core challenges of agentic memory management in LLM serving, where large-scale storage, frequent updates, and multiple coexisting agents jointly introduce complex and high-cost approximate nearest neighbor (ANN) searching problems. We present Pancake, a multi-tier agentic memory system that unifies three key techniques: (i) multi-level index caching for single agents, (ii) coordinated index management across multiple agents, and (iii) collaborative GPU-CPU acceleration. Pancake exposes easy-to-use interface that can be integrated into memory-based agents like Mem-GPT, and is compatible with agentic frameworks such as LangChain and LlamaIndex. Experiments on realistic agent workloads show that Pancake substantially outperforms existing frameworks, achieving more than 4.29x end-to-end throughput improvement.