OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high memory overhead and latency in modern data centers caused by the InfiniBand queue-pair abstraction of RDMA, which creates a performance bottleneck at the NIC. The authors present the first clean-room, open-source implementation of Huawei’s Unified Bus (UB) protocol, encompassing both transport and transaction layers. UB decouples application endpoints from host transport state, supports on-demand ordering, and enables remote memory access using native CPU load/store semantics. The project provides reproducible evaluation platforms at RTL, SystemC, and gem5 levels, enabling fair comparison with RoCEv2. Experimental results demonstrate that, for 64-byte remote reads, UB achieves an end-to-end latency of approximately 500 nanoseconds—4.37× lower than RoCEv2—with 2.80× higher throughput, while consuming only 14% of the logic resources on an Alveo U50 FPGA.
📝 Abstract
Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand. Huawei's Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei's closed Ascend 950 silicon. OpenURMA is the first clean-room open implementation of UB's transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB's load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50's LUTs.
Problem

Research questions and friction points this paper is trying to address.

RDMA
network interface bottleneck
Queue Pair
latency
throughput
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Bus
OpenURMA
RDMA
clean-room implementation
low-latency interconnect