Compute Can't Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bandwidth bottlenecks, high inter-GPU communication overhead, and rigid resource allocation in GPU-centric architectures scaling large language models (LLMs) and retrieval-augmented generation (RAG), this paper proposes a CXL-based modular datacenter architecture. Our method integrates CXL, HBM, silicon photonics, and UALink/NVLink, featuring: (1) a novel CXL-over-XLink hybrid interconnect enabling low-latency, high-bandwidth cross-node memory access; and (2) a hierarchical, globally coherent disaggregated memory model that significantly reduces long-distance data migration costs. Evaluated against conventional GPU-centric designs, our architecture achieves 35–60% improvements in throughput, resource utilization, and horizontal scalability. It efficiently supports thousand-GPU-scale AI training and inference workloads while mitigating memory and interconnect bottlenecks inherent in monolithic GPU clusters.

Technology Category

Application Category

📝 Abstract
Modern AI workloads such as large language models (LLMs) and retrieval-augmented generation (RAG) impose severe demands on memory, communication bandwidth, and resource flexibility. Traditional GPU-centric architectures struggle to scale due to growing inter-GPU communication overheads. This report introduces key AI concepts and explains how Transformers revolutionized data representation in LLMs. We analyze large-scale AI hardware and data center designs, identifying scalability bottlenecks in hierarchical systems. To address these, we propose a modular data center architecture based on Compute Express Link (CXL) that enables disaggregated scaling of memory, compute, and accelerators. We further explore accelerator-optimized interconnects-collectively termed XLink (e.g., UALink, NVLink, NVLink Fusion)-and introduce a hybrid CXL-over-XLink design to reduce long-distance data transfers while preserving memory coherence. We also propose a hierarchical memory model that combines local and pooled memory, and evaluate lightweight CXL implementations, HBM, and silicon photonics for efficient scaling. Our evaluations demonstrate improved scalability, throughput, and flexibility in AI infrastructure.
Problem

Research questions and friction points this paper is trying to address.

Addressing memory and communication bottlenecks in modern AI workloads
Proposing scalable modular data center architecture using CXL
Optimizing interconnects and memory hierarchy for AI infrastructure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular data center architecture with CXL
Hybrid CXL-over-XLink interconnect design
Hierarchical memory model combining local and pooled memory
🔎 Similar Papers
No similar papers found.