Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the CXL memory bandwidth bottleneck that limits large-model inference performance, this paper proposes CXL-NDP, a transparent near-data processing architecture. Without modifying the CXL.mem protocol or AI model implementations, CXL-NDP integrates scalable-precision bitplane layout and transparent lossless compression directly within CXL devices to enable dynamic quantization and in-situ processing of model weights and KV caches. This approach alleviates bandwidth constraints while preserving numerical fidelity: it achieves a 43% end-to-end inference throughput improvement, extends maximum context length by 87%, and reduces KV cache memory footprint by 46.9%, all with zero precision loss. The design incurs modest hardware overhead, ensuring practical deployability and scalability across diverse CXL-based accelerator systems.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) inference is bottlenecked by the limited bandwidth of CXL-based memory used for capacity expansion. We introduce CXL-NDP, a transparent near-data processing architecture that amplifies effective CXL bandwidth without requiring changes to the CXL.mem interface or AI models. CXL-NDP integrates a precision-scalable bit-plane layout for dynamic quantization with transparent lossless compression of weights and KV caches directly within the CXL device. In end-to-end serving, CXL-NDP improves throughput by 43%, extends the maximum context length by 87%, and reduces the KV cache footprint by 46.9% without accuracy loss. Hardware synthesis confirms its practicality with a modest silicon footprint, lowering the barrier for adopting efficient, scalable CXL-based memory in generative AI infrastructure.
Problem

Research questions and friction points this paper is trying to address.

Amplifying CXL memory bandwidth for LLM inference bottlenecks
Enabling transparent near-data processing without interface modifications
Reducing KV cache footprint while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transparent near-data processing architecture
Precision-scalable bit-plane dynamic quantization
Lossless compression of weights and KV caches
🔎 Similar Papers
No similar papers found.
R
Rui Xie
Rensselaer Polytechnic Institute, Troy, NY , USA
Asad Ul Haq
Asad Ul Haq
Graduate Student, RPI
Computer Systems Engineering
Linsen Ma
Linsen Ma
Rensselaer Polytechnic Institute
Yunhua Fang
Yunhua Fang
Graduate Student, Rensselaer Polytechnic Institute
LLM inferencememory architecture
Z
Zirak Burzin Engineer
Wiseburn Da Vinci Science, El Segundo, CA, USA
L
Liu Liu
Rensselaer Polytechnic Institute, Troy, NY , USA
T
Tong Zhang
Rensselaer Polytechnic Institute, Troy, NY , USA