PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms

📅 2024-10-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the need for lightweight and efficient inference of large language models (LLMs) on mobile devices, this paper introduces the first automated benchmarking framework that jointly evaluates resource efficiency and safety risks. Methodologically, it integrates weight and activation quantization configurations, supports cross-platform performance profiling across ARM CPU/GPU/NPU backends, and incorporates real-time power consumption monitoring alongside automated hallucination and toxicity detection. Contributions include: (1) a mobile-specific, multi-dimensional evaluation paradigm that unifies assessment of generation quality, latency, throughput, memory footprint, power draw, and harmful output; (2) empirical insights into nonlinear trade-offs among mobile chip energy efficiency, quantization strategies, and their impacts on latency and memory; and (3) quantitative evidence of systematic accuracy and safety degradation under model compression—establishing a reproducible, scalable evaluation standard for on-device LLM deployment.

Technology Category

Application Category

📝 Abstract
Deploying large language models (LLMs) locally on mobile devices is advantageous in scenarios where transmitting data to remote cloud servers is either undesirable due to privacy concerns or impractical due to network connection. Recent advancements (MLC, 2023a; Gerganov, 2023) have facilitated the local deployment of LLMs. However, local deployment also presents challenges, particularly in balancing quality (generative performance), latency, and throughput within the hardware constraints of mobile devices. In this paper, we introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate LLMs on mobile devices. We provide a comprehensive benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities. Unlike traditional benchmarks that assess full-scale models on high-end GPU clusters, we focus on evaluating resource efficiency (memory and power consumption) and harmful output for compressed models on mobile devices. Our key observations include i) differences in energy efficiency and throughput across mobile platforms; ii) the impact of quantization on memory usage, GPU execution time, and power consumption; and iii) accuracy and performance degradation of quantized models compared to their non-quantized counterparts; and iv) the frequency of hallucinations and toxic content generated by compressed LLMs on mobile devices.
Problem

Research questions and friction points this paper is trying to address.

Mobile Devices
Large Language Models
Efficiency and Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

PalmBench
Mobile Device Performance Evaluation
Compressed Large Language Models
Yilong Li
Yilong Li
PhD, Stanford University
operating systemsdistributed systemsdatacenter computingnetworking
J
Jingyu Liu
University of Wisconsin – Madison
H
Hao Zhang
University of Wisconsin – Madison
M
M. B. Narayanan
University of Wisconsin – Madison
Utkarsh Sharma
Utkarsh Sharma
University of Wisconsin – Madison
S
Shuai Zhang
Amazon Web Services AI, USA
P
Pan Hu
Uber, USA
Y
Yijing Zeng
University of Wisconsin – Madison
J
Jayaram Raghuram
University of Wisconsin – Madison
Suman Banerjee
Suman Banerjee
Department of CSE, IIT Jammu
Algorithmic Data ManagementSocial Network AnalysisGraph Theory and Graph AlgorithmsParameterized Complexity