Pancake: Hierarchical Memory System for Multi-Agent LLM Serving

πŸ“… 2026-02-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high-cost approximate nearest neighbor (ANN) search challenges in large language model–based multi-agent systems, which arise from massive storage demands, frequent memory updates, and concurrent agent coexistence. To tackle this, we propose a hierarchical agent memory system that uniquely integrates multi-level index caching, cross-agent collaborative index management, and GPU-CPU heterogeneous computing acceleration within a unified framework. The design is compatible with mainstream agent platforms such as LangChain and LlamaIndex. Experimental evaluation under realistic multi-agent workloads demonstrates that our system achieves over 4.29Γ— higher end-to-end throughput compared to existing approaches, significantly reducing ANN retrieval overhead while enhancing overall system efficiency.

Technology Category

Application Category

πŸ“ Abstract
In this work, we identify and address the core challenges of agentic memory management in LLM serving, where large-scale storage, frequent updates, and multiple coexisting agents jointly introduce complex and high-cost approximate nearest neighbor (ANN) searching problems. We present Pancake, a multi-tier agentic memory system that unifies three key techniques: (i) multi-level index caching for single agents, (ii) coordinated index management across multiple agents, and (iii) collaborative GPU-CPU acceleration. Pancake exposes easy-to-use interface that can be integrated into memory-based agents like Mem-GPT, and is compatible with agentic frameworks such as LangChain and LlamaIndex. Experiments on realistic agent workloads show that Pancake substantially outperforms existing frameworks, achieving more than 4.29x end-to-end throughput improvement.
Problem

Research questions and friction points this paper is trying to address.

agentic memory management
approximate nearest neighbor search
multi-agent LLM serving
memory system
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical memory system
multi-agent LLM serving
approximate nearest neighbor search
index caching
GPU-CPU acceleration
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhengding Hu
Computer Science and Engineering, University of California, San Diego
Zaifeng Pan
Zaifeng Pan
University of California, San Diego
Machine Learning Systems
P
Prabhleen Kaur
Computer Science and Engineering, University of California, San Diego
V
Vibha Murthy
Computer Science and Engineering, University of California, San Diego
Z
Zhongkai Yu
Computer Science and Engineering, University of California, San Diego
Yue Guan
Yue Guan
University of California, San Diego
Model CompressionML System
Zhen Wang
Zhen Wang
Postdoc at UCSD
Machine LearningLarge Language ModelsNatural Language Processing
S
Steven Swanson
Computer Science and Engineering, University of California, San Diego
Yufei Ding
Yufei Ding
University of California, San Diego
Compiler and Computer ArchitectureMachine LearningQuantum computing