Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

248K/year
🤖 AI Summary
This work addresses the challenge of accurately predicting inference latency under dynamic voltage and frequency scaling (DVFS) on mobile edge devices, where fluctuating CPU/GPU frequencies render traditional static analysis ineffective and exhaustive empirical profiling prohibitively expensive—particularly for small language models (SLMs) with variable context lengths. To overcome this, the authors propose FLAME, a novel method that introduces the first fine-grained model of asynchronous CPU-GPU execution. FLAME enables bottom-up, frequency-aware latency prediction across both deep neural networks (DNNs) and SLMs by combining layer-level delay decomposition, quantification of parallel execution overlap and pipeline bubbles, and frequency extrapolation from sparsely sampled measurements. The approach reduces modeling time from hours or days to minutes, drastically cuts required empirical samples, maintains low prediction error, and enhances deadline-aware DVFS scheduling by jointly optimizing energy efficiency and latency guarantees, outperforming existing solutions.

Technology Category

Application Category

📝 Abstract
Precise estimation of model inference latency is crucial for time-critical mobile edge applications, enabling devices to calculate latency margins against deadlines and trade them for enhanced model performance or resource savings. However, the ubiquity of Dynamic Voltage and Frequency Scaling (DVFS) renders traditional static profiling invalid in real-world deployments, as inference latency fluctuates with varying processor (CPU and GPU) frequencies. While extensive profiling across frequency combinations is theoretically possible, it is prohibitively expensive, particularly for emerging Small Language Models (SLMs), where variable context lengths explode the profiling up to days. We observe that simple analytic scaling fails to predict these fluctuations due to the complex asynchronous coupling between CPU (kernel launching) and GPU (execution). In this paper, we introduce FLAME to accurately estimate inference latency across frequency combinations. It features a novel layer-wise modeling that quantifies the overlapping parallelism and then aggregates dynamic pipeline bubbles caused by asynchronous processor interactions when extending to the full model. This bottom-up approach ensures generalizability across diverse models from DNNs to SLMs, and its precise modeling allows for profiling a sparse subset of samples, cutting DNN profiling from hours to minutes and SLM profiling from days to mere minutes, while maintaining small estimation errors across frequencies. We further showcase FLAME's utility in a deadline-aware DVFS, outperforming the state-of-the-art approach in both power efficiency and latency guarantees.
Problem

Research questions and friction points this paper is trying to address.

latency estimation
asynchronous CPU-GPU coupling
Dynamic Voltage and Frequency Scaling
mobile edge computing
Small Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

asynchronous CPU-GPU coupling
frequency-aware latency estimation
layer-wise modeling
dynamic pipeline bubbles
sparse profiling
🔎 Similar Papers
No similar papers found.