PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms

📅 2024-10-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

To address the need for lightweight and efficient inference of large language models (LLMs) on mobile devices, this paper introduces the first automated benchmarking framework that jointly evaluates resource efficiency and safety risks. Methodologically, it integrates weight and activation quantization configurations, supports cross-platform performance profiling across ARM CPU/GPU/NPU backends, and incorporates real-time power consumption monitoring alongside automated hallucination and toxicity detection. Contributions include: (1) a mobile-specific, multi-dimensional evaluation paradigm that unifies assessment of generation quality, latency, throughput, memory footprint, power draw, and harmful output; (2) empirical insights into nonlinear trade-offs among mobile chip energy efficiency, quantization strategies, and their impacts on latency and memory; and (3) quantitative evidence of systematic accuracy and safety degradation under model compression—establishing a reproducible, scalable evaluation standard for on-device LLM deployment.

Technology Category

Application Category

📝 Abstract

Deploying large language models (LLMs) locally on mobile devices is advantageous in scenarios where transmitting data to remote cloud servers is either undesirable due to privacy concerns or impractical due to network connection. Recent advancements (MLC, 2023a; Gerganov, 2023) have facilitated the local deployment of LLMs. However, local deployment also presents challenges, particularly in balancing quality (generative performance), latency, and throughput within the hardware constraints of mobile devices. In this paper, we introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate LLMs on mobile devices. We provide a comprehensive benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities. Unlike traditional benchmarks that assess full-scale models on high-end GPU clusters, we focus on evaluating resource efficiency (memory and power consumption) and harmful output for compressed models on mobile devices. Our key observations include i) differences in energy efficiency and throughput across mobile platforms; ii) the impact of quantization on memory usage, GPU execution time, and power consumption; and iii) accuracy and performance degradation of quantized models compared to their non-quantized counterparts; and iv) the frequency of hallucinations and toxic content generated by compressed LLMs on mobile devices.

Problem

Research questions and friction points this paper is trying to address.

Mobile Devices

Large Language Models

Efficiency and Quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

PalmBench

Mobile Device Performance Evaluation

Compressed Large Language Models

🔎 Similar Papers

Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices