Pooling Engram Conditional Memory in Large Language Models using CXL

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high storage cost and low-latency sparse access challenges posed by engram-based conditional memory in large language models, stemming from the massive scale of embedding tables. To overcome these limitations, the study proposes the first use of Compute Express Link (CXL) memory pools for engram memory storage, offloading it from main memory to cost-effective, low-latency CXL-attached devices. The approach is integrated with the SGLang inference framework to enable efficient memory access. Compared to RDMA-based solutions, the proposed method supports finer-grained and lower-latency memory operations, achieving end-to-end inference performance comparable to DRAM while significantly improving storage scalability and cost efficiency.

Technology Category

Application Category

📝 Abstract
Engram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.
Problem

Research questions and friction points this paper is trying to address.

Engram conditional memory
Large Language Models
CXL memory pool
memory offloading
low-latency access
Innovation

Methods, ideas, or system contributions that make the work stand out.

Engram conditional memory
CXL memory pooling
large language models
offloading embedding tables
low-latency retrieval
Ruiyang Ma
Ruiyang Ma
School of Computer Science, Peking University
EDAHardware Verification
T
Teng Ma
Alibaba Cloud
Z
Zhiyuan Su
Shandong Yingxin Computer Technology Co., Ltd
H
Hantian Zha
Renmin University of China
X
Xinpeng Zhao
Alibaba Cloud
X
Xuchun Shang
Alibaba Cloud
X
Xingrui Yi
Alibaba Cloud
Z
Zheng Liu
Alibaba Cloud
Zhu Cao
Zhu Cao
Tsinghua University
A
An Wu
Shandong Yingxin Computer Technology Co., Ltd
Z
Zhichong Dou
Shandong Yingxin Computer Technology Co., Ltd
Z
Ziqian Liu
The University of Hong Kong
D
Daikang Kuang
Peking University
Guojie Luo
Guojie Luo
Peking University
Electronic Design AutomationReconfigurable Architecture