TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale

๐Ÿ“… 2025-12-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In decoupled LLM inference architectures, KV tensor transfer between the prefill and decode stages induces severe network contention, increasing time-to-first-token (TTFT), limiting throughput, and rendering prefix reuse ineffective. This paper proposes a rack-scale unified KV store leveraging CXL 3.0 shared memoryโ€”novelly employing CXL memory both as a low-overhead KV transport substrate and as a prefix-aware shared cache. We design a two-tier cross-node synchronization protocol to address data consistency and chunk management challenges in non-uniform CXL memory. The system integrates GPU Direct Load/Store, DMA acceleration, and the Dynamo framework. Evaluated against an RDMA+DRAM baseline, our approach achieves a 9.8ร— reduction in average TTFT, a 6.2ร— improvement in P99 latency, and a 1.6ร— increase in peak throughput.

Technology Category

Application Category

๐Ÿ“ Abstract
Disaggregated LLM serving improves resource efficiency by separating the compute-intensive prefill phase from the latency-critical decode phase. However, this architecture introduces a fundamental bottleneck: key/value (KV) tensors generated during prefill must be transferred to decode workers, and existing systems rely on RDMA-based network paths for this exchange. As model sizes and context lengths increase, KV transfer dominates both time-to-first-token (TTFT) and peak throughput, and remains highly sensitive to network contention even when prefix reuse is high. This paper presents TraCT, a rack-scale LLM serving system that uses CXL shared memory as both a KV-transfer substrate and a rack-wide prefix-aware KV cache. TraCT enables GPUs to write and read KV blocks directly through CXL load/store and DMA operations, eliminating the NIC hop that constrains existing disaggregated pipelines. However, to realize this design, multiple new challenges such as synchronization, consistency, and data management on non-coherent CXL memory need to be addressed. TraCT proposes various software solutions such as the two-tier inter-node synchronization mechanism to address these challenges. We implement TraCT on the Dynamo LLM inference framework and show that, across static and synthetic workloads, TraCT reduces average TTFT by up to 9.8x, lowers P99 latency by up to 6.2x, and improves peak throughput by up to 1.6x compared to RDMA and DRAM-based caching baselines.
Problem

Research questions and friction points this paper is trying to address.

Eliminates KV tensor transfer bottleneck in disaggregated LLM serving.
Replaces RDMA network paths with CXL shared memory for KV cache.
Addresses synchronization and consistency challenges in non-coherent CXL memory.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses CXL shared memory for KV cache transfer
Enables direct GPU access via CXL load/store operations
Implements two-tier synchronization for non-coherent memory
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Dongha Yoon
Virginia Tech
Y
Younghoon Min
SK Hynix America
H
Hoshik Kim
SK Hynix America
Sam H. Noh
Sam H. Noh
Professor of Computer Science, Virginia Tech
Systems SoftwareOperating SystemsFile SystemsStorage Systems
J
Jongryool Kim
SK Hynix America