π€ AI Summary
Deploying large language models (LLMs) on mobile devices suffers from suboptimal energy efficiency and elevated inference latency due to uncoordinated, component-isolated DVFS controllers for CPU, GPU, and memory. This work is the first to systematically identify and characterize the coupled performance bottlenecks among these three components during LLM inference. We propose FUSEβthe first unified, energy-efficiency-aware, cross-component DVFS orchestration framework tailored for mobile LLM inference. Leveraging deep hardware-level measurements and real-world LLM workload profiling, FUSE jointly optimizes operating frequencies across CPU, GPU, and memory in real time. Experimental evaluation demonstrates that, under identical energy budgets, FUSE reduces first-token latency by 7.0%β16.9% and per-token output latency by 25.4%β36.8%, significantly improving both energy efficiency and responsiveness.
π Abstract
Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.