Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge of deploying large language models (LLMs) on resource-constrained edge devices, where high inference latency often impedes real-time performance. Existing speculative decoding approaches struggle to effectively integrate compiler optimizations with heterogeneous hardware scheduling. To overcome this limitation, the paper proposes a compiler-driven heterogeneous partitioning strategy that, for the first time, jointly combines speculative sampling with coarse-grained CPU/GPU subgraph partitioning. The authors introduce an analytical cost model capable of predicting the combined performance gains of this co-optimization, which guides partitioning decisions while preserving programmability. Evaluated on an edge platform featuring a six-core Cortex-A CPU and a Mali GPU, the approach achieves up to 1.68× speedup on representative short-input translation tasks, with empirical results closely matching the model’s predictions.

Technology Category

Application Category

📝 Abstract

LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when speculative sampling and heterogeneous execution are jointly beneficial and is validated on an edge device featuring a hexacore Cortex-A CPU and a Mali GPU, revealing up to 1.68$\times$ speedup for translation tasks, closely matching analytic expectations.

Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding

LLM inference

heterogeneous edge devices

compiler-based workflow

resource-constrained deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

Heterogeneous Edge Computing

Compiler-Assisted Optimization