🤖 AI Summary
This work addresses the computational intensity of 5G NR physical-layer LDPC decoding, which struggles to coexist with other CPU tasks within the stringent 0.5 ms slot duration, often leading to latency violations. For the first time, we implement an AI-native Open RAN solution by offloading LDPC decoding from a Grace CPU to a Blackwell GB10 GPU using the high-level Sionna PHY/SYS framework, enabling a scriptable and reusable physical-layer evaluation methodology without manual CUDA optimization. Evaluated on a DGX Spark platform under 16-QAM modulation, AWGN channel conditions, and belief propagation decoding, our approach achieves an average 6× speedup, reducing per-codeword latency to 0.12–0.48 ms—occupying only 6%–24% of a slot—while increasing power consumption by merely 10–15 W, thereby substantially freeing CPU resources and enhancing system scalability.
📝 Abstract
Low-density parity-check (LDPC) decoding is one of the most computationally intensive kernels in the 5G New Radio (NR) physical layer and must complete within a 0.5\,ms transmission time interval while sharing the budget with FFT, channel estimation, demapping, HARQ, and MAC scheduling. Many open and proprietary stacks still execute LDPC on general-purpose CPUs, raising concerns about missed-slot events and limited scalability as bandwidths, modulation orders, and user multiplexing increase. This paper empirically quantifies the benefit of offloading 5G-style LDPC5G decoding from a Grace CPU to the integrated Blackwell GB10 GPU on an NVIDIA DGX~Spark platform. Using NVIDIA Sionna PHY/SYS on TensorFlow, we construct an NR-like link-level chain with an LDPC5G encoder/decoder, 16-QAM modulation, and AWGN, and sweep both the number of codewords decoded in parallel and the number of belief-propagation iterations, timing only the decoding phase while logging CPU and GPU utilization and power. Across the sweep we observe an average GPU/CPU throughput speedup of approximately $6\times$, with per-codeword CPU latency reaching $\approx 0.71$\,ms at 20 iterations (exceeding the 0.5\,ms slot), while the GB10 GPU remains within 6--24\% of the slot for the same workloads. Resource-usage measurements show that CPU-based LDPC decoding often consumes around ten Grace cores, whereas GPU-based decoding adds only $\approx10-15$\,W over GPU idle while leaving most CPU capacity available for higher-layer tasks. Because our implementation relies on high-level Sionna layers rather than hand-tuned CUDA, these results represent conservative lower bounds on achievable accelerator performance and provide a reusable, scriptable methodology for evaluating LDPC and other physical-layer kernels on future Grace/Blackwell and Aerial/ACAR/AODT platforms.