🤖 AI Summary
This work addresses the high storage cost and low-latency sparse access challenges posed by engram-based conditional memory in large language models, stemming from the massive scale of embedding tables. To overcome these limitations, the study proposes the first use of Compute Express Link (CXL) memory pools for engram memory storage, offloading it from main memory to cost-effective, low-latency CXL-attached devices. The approach is integrated with the SGLang inference framework to enable efficient memory access. Compared to RDMA-based solutions, the proposed method supports finer-grained and lower-latency memory operations, achieving end-to-end inference performance comparable to DRAM while significantly improving storage scalability and cost efficiency.
📝 Abstract
Engram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.