Compute Can't Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

To address memory bandwidth bottlenecks, high inter-GPU communication overhead, and rigid resource allocation in GPU-centric architectures scaling large language models (LLMs) and retrieval-augmented generation (RAG), this paper proposes a CXL-based modular datacenter architecture. Our method integrates CXL, HBM, silicon photonics, and UALink/NVLink, featuring: (1) a novel CXL-over-XLink hybrid interconnect enabling low-latency, high-bandwidth cross-node memory access; and (2) a hierarchical, globally coherent disaggregated memory model that significantly reduces long-distance data migration costs. Evaluated against conventional GPU-centric designs, our architecture achieves 35–60% improvements in throughput, resource utilization, and horizontal scalability. It efficiently supports thousand-GPU-scale AI training and inference workloads while mitigating memory and interconnect bottlenecks inherent in monolithic GPU clusters.

Technology Category

Application Category

📝 Abstract

Modern AI workloads such as large language models (LLMs) and retrieval-augmented generation (RAG) impose severe demands on memory, communication bandwidth, and resource flexibility. Traditional GPU-centric architectures struggle to scale due to growing inter-GPU communication overheads. This report introduces key AI concepts and explains how Transformers revolutionized data representation in LLMs. We analyze large-scale AI hardware and data center designs, identifying scalability bottlenecks in hierarchical systems. To address these, we propose a modular data center architecture based on Compute Express Link (CXL) that enables disaggregated scaling of memory, compute, and accelerators. We further explore accelerator-optimized interconnects-collectively termed XLink (e.g., UALink, NVLink, NVLink Fusion)-and introduce a hybrid CXL-over-XLink design to reduce long-distance data transfers while preserving memory coherence. We also propose a hierarchical memory model that combines local and pooled memory, and evaluate lightweight CXL implementations, HBM, and silicon photonics for efficient scaling. Our evaluations demonstrate improved scalability, throughput, and flexibility in AI infrastructure.

Problem

Research questions and friction points this paper is trying to address.

Addressing memory and communication bottlenecks in modern AI workloads

Proposing scalable modular data center architecture using CXL

Optimizing interconnects and memory hierarchy for AI infrastructure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular data center architecture with CXL

Hybrid CXL-over-XLink interconnect design

Hierarchical memory model combining local and pooled memory

🔎 Similar Papers

The rising costs of training frontier AI models