HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work addresses the inefficiency of existing distributed large language model (LLM) inference in resource-constrained and unstable edge environments, where strict synchronization mechanisms severely limit performance. To overcome this, the authors propose a semantic-aware neuron grouping and scheduling approach that enables loosely coupled distributed inference. By dynamically evaluating neuron importance, parallelizing the loading of critical neuron groups, and incorporating heterogeneous device-aware load balancing, the method significantly enhances both system responsiveness and robustness. Experimental results on Raspberry Pi clusters demonstrate a 3.41× end-to-end speedup for LLaMA-family models, with performance remaining close to ideal even under high packet loss rates—substantially outperforming current state-of-the-art solutions.

Technology Category

Application Category

📝 Abstract

The deployment of large language models'(LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy. However, it is critically challenged by the resource constraints of a single edge node. Distributed inference has emerged to aggregate and leverage computational resources across multiple devices. Yet, existing methods typically require strict synchronization, which is often infeasible due to the unreliable network conditions. In this paper, we propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network. The core idea is to enable a relaxed yet effective synchronization by strategically allocating less critical neuron groups to unstable devices, thus avoiding the excessive waiting time incurred by delayed packets. HALO introduces three key mechanisms: (1) a semantic-aware predictor to assess the significance of neuron groups prior to activation. (2) a parallel execution scheme of neuron group loading during the model inference. (3) a load-balancing scheduler that efficiently orchestrates multiple devices with heterogeneous resources. Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions. It maintains performance comparable to optimal conditions and significantly outperforms the state-of-the-art in various scenarios.

Problem

Research questions and friction points this paper is trying to address.

distributed LLM inference

lossy edge network

resource constraints

unreliable network

edge computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-aware

distributed LLM inference

lossy edge network