RDMA Point-to-Point Communication for LLM Systems

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Existing LLM systems—including decoupled inference, MoE routing, and asynchronous reinforcement learning fine-tuning—rely on flexible point-to-point communication, yet mainstream RDMA implementations are tightly coupled to vendor-specific NICs, hindering portability and integration into inference engines. This work proposes TransferEngine, a universal RDMA communication framework that introduces a unified abstraction layer and novel primitives—WriteImm and ImmCounter—to enable precise, out-of-order transmission completion notification. TransferEngine achieves transparent, hardware-agnostic management across heterogeneous NICs (e.g., NVIDIA ConnectX-7, AWS EFA). Evaluation demonstrates: (1) efficient and reliable dynamic KvCache migration; (2) trillion-parameter RL weight updates completed in just 1.3 seconds; and (3) MoE inference latency lower than DeepEP with peak throughput reaching 400 Gbps.

Technology Category

Application Category

📝 Abstract

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs to expose a uniform interface. TransferEngine exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase TransferEngine through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in.

Problem

Research questions and friction points this paper is trying to address.

Enabling flexible point-to-point communication for LLM systems

Overcoming hardware lock-in across different Network Interface Controllers

Providing portable RDMA communication for disaggregated inference and MoE

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridges common NICs to expose uniform interface

Uses one-sided WriteImm with ImmCounter for completion

Transparently manages multiple NICs per GPU

🔎 Similar Papers

No similar papers found.