SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud

๐Ÿ“… 2026-02-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of co-locating large language model (LLM) inference with real-time baseband processing in a 5G standalone (SA) radio access network (RAN) under sub-second service-level agreement (SLA) constraints. The authors propose a distributed inference framework leveraging a deviceโ€“RANโ€“cloud three-tier heterogeneous architecture. For the first time, they systematically evaluate the SLA feasibility of quantized LLMs on a real-world AI-RAN testbed and introduce Multi-Instance GPU (MIG) isolation to safeguard baseband timing guarantees under high load. Experimental results demonstrate that on-device inference fails to meet sub-second latency requirements, while deploying quantized models at the RAN edge consistently achieves response times under 0.5 seconds. Cloud-based inference meets the 1.0-second SLA with 100% success rate, and MIG effectively preserves baseband processing timing even under 20 concurrent LLM requests and saturated network traffic.

Technology Category

Application Category

๐Ÿ“ Abstract
Embodied AI requires sub-second inference near the Radio Access Network (RAN), but deployments span heterogeneous tiers (on-device, RAN-edge, cloud) and must not disrupt real-time baseband processing. We report measurements from a 5G Standalone (SA) AI-RAN testbed using a fixed baseline policy for repeatability. The setup includes an on-device tier, a three-node RAN-edge cluster co-hosting a containerized 5G RAN, and a cloud tier. We find that on-device execution remains multi-second and fails to meet sub-second budgets. At the RAN edge, SLA feasibility is primarily determined by model variant choice: quantized models concentrate below 0.5\,s, while unquantized and some larger quantized models incur deadline misses due to stalls and queuing. In the cloud tier, meeting a 0.5\,s deadline is challenging on the measured WAN path (up to 32.9\% of requests complete within 0.5\,s), but all evaluated variants meet a 1.0\,s deadline (100\% within 1.0\,s). Under saturated downlink traffic and up to $N{=}20$ concurrent inference clients, Multi-Instance GPU (MIG) isolation preserves baseband timing-health proxies, supporting safe co-location under fixed partitioning.
Problem

Research questions and friction points this paper is trying to address.

SLA-aware inference
distributed LLM inference
RAN-edge computing
real-time baseband processing
heterogeneous deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

SLA-aware inference
AI-RAN co-location
quantized LLM
Multi-Instance GPU (MIG)
distributed LLM inference
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hariz Yet
Singapore University of Technology and Design (SUTD)
N
Nguyen Thanh Tam
Singapore University of Technology and Design (SUTD)
Mao V. Ngo
Mao V. Ngo
Singapore University of Technology and Design
O-RANAI-RANMachine LearningEdge ComputingIoT
L
Lim Yi Shen
Singapore University of Technology and Design (SUTD)
L
Lin Wei
Singapore University of Technology and Design (SUTD)
Jihong Park
Jihong Park
Associate Professor, SUTD, SMIEEE
Wireless CommunicationsSemantic CommunicationDistributed Machine LearningAI-RAN
Binbin Chen
Binbin Chen
Singapore University of Technology and Design (SUTD)
Networked systemsCyber-physical systemsDistributed systemsWireless networking
T
Tony Q. S. Quek
Singapore University of Technology and Design (SUTD)