Pooling Engram Conditional Memory in Large Language Models using CXL

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the high storage cost and low-latency sparse access challenges posed by engram-based conditional memory in large language models, stemming from the massive scale of embedding tables. To overcome these limitations, the study proposes the first use of Compute Express Link (CXL) memory pools for engram memory storage, offloading it from main memory to cost-effective, low-latency CXL-attached devices. The approach is integrated with the SGLang inference framework to enable efficient memory access. Compared to RDMA-based solutions, the proposed method supports finer-grained and lower-latency memory operations, achieving end-to-end inference performance comparable to DRAM while significantly improving storage scalability and cost efficiency.

Technology Category

Application Category

📝 Abstract

Engram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.

Problem

Research questions and friction points this paper is trying to address.

Engram conditional memory

Large Language Models

CXL memory pool

memory offloading

low-latency access

Innovation

Methods, ideas, or system contributions that make the work stand out.

Engram conditional memory

CXL memory pooling

large language models