Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency

πŸ“… 2025-07-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Deploying large language models (LLMs) on mobile devices suffers from suboptimal energy efficiency and elevated inference latency due to uncoordinated, component-isolated DVFS controllers for CPU, GPU, and memory. This work is the first to systematically identify and characterize the coupled performance bottlenecks among these three components during LLM inference. We propose FUSEβ€”the first unified, energy-efficiency-aware, cross-component DVFS orchestration framework tailored for mobile LLM inference. Leveraging deep hardware-level measurements and real-world LLM workload profiling, FUSE jointly optimizes operating frequencies across CPU, GPU, and memory in real time. Experimental evaluation demonstrates that, under identical energy budgets, FUSE reduces first-token latency by 7.0%–16.9% and per-token output latency by 25.4%–36.8%, significantly improving both energy efficiency and responsiveness.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.
Problem

Research questions and friction points this paper is trying to address.

Optimizing DVFS governors for mobile LLM energy efficiency
Reducing latency in LLM inference on resource-limited devices
Improving coordination among CPU, GPU, and memory governors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified energy-aware governor for LLM inference
Optimizes CPU, GPU, memory frequencies jointly
Reduces latency while maintaining energy efficiency
πŸ”Ž Similar Papers
No similar papers found.
Z
Zongpu Zhang
Shanghai Jiao Tong University, Shanghai, China and Purdue University, West Lafayette, USA
P
Pranab Dash
Purdue University, West Lafayette, USA
Y. Charlie Hu
Y. Charlie Hu
Michael and Katherine Birck Professor of Electrical and Computer Engineering, Purdue University
Edge AI SystemsSmartphone Energy ManagementComputer NetworksDistributed Systems
Q
Qiang Xu
Purdue University, West Lafayette, USA
J
Jian Li
Shanghai Jiao Tong University, Shanghai, China
H
Haibing Guan
Shanghai Jiao Tong University, Shanghai, China