Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Deploying large language models (LLMs) on mobile devices suffers from suboptimal energy efficiency and elevated inference latency due to uncoordinated, component-isolated DVFS controllers for CPU, GPU, and memory. This work is the first to systematically identify and characterize the coupled performance bottlenecks among these three components during LLM inference. We propose FUSE—the first unified, energy-efficiency-aware, cross-component DVFS orchestration framework tailored for mobile LLM inference. Leveraging deep hardware-level measurements and real-world LLM workload profiling, FUSE jointly optimizes operating frequencies across CPU, GPU, and memory in real time. Experimental evaluation demonstrates that, under identical energy budgets, FUSE reduces first-token latency by 7.0%–16.9% and per-token output latency by 25.4%–36.8%, significantly improving both energy efficiency and responsiveness.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.

Problem

Research questions and friction points this paper is trying to address.

Optimizing DVFS governors for mobile LLM energy efficiency

Reducing latency in LLM inference on resource-limited devices

Improving coordination among CPU, GPU, and memory governors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified energy-aware governor for LLM inference

Optimizes CPU, GPU, memory frequencies jointly

Reduces latency while maintaining energy efficiency

🔎 Similar Papers

Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices