Harnessing Large Language Models Locally: Empirical Results and Implications for AI PC

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the joint optimization of model capability, development efficiency, and system resource constraints for local large language model (LLM) deployment on resource-constrained edge devices (e.g., AI PCs). We systematically evaluate 12 open-weight LLMs (0.5B–14B parameters) under seven post-training quantization (PTQ) schemes on CPU hardware. Our empirical analysis reveals a near-linear relationship between effective bits per weight (BPW) and inference latency, memory footprint, and power consumption—identifying ~3.5 BPW as a critical performance inflection point where low-bit large models outperform high-precision smaller ones. Quantization to low BPW incurs negligible accuracy degradation while reducing memory usage by up to 72% and shifting power consumption dominance toward compute-intensive operations. The work delivers a reproducible, empirically grounded configuration guide for edge LLM deployment, establishing an engineering paradigm that balances privacy preservation with computational efficiency.

Technology Category

Application Category

📝 Abstract
The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around $sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) Quantization with low BPW incurs marginal accuracy loss but significant memory savings. 4) Determined by low-level implementation specifics power consumption on CPU, where computation-intensive operations spend more power than memory-intensive ones. These findings offer crucial insights and practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://github.com/simmonssong/LLMOnDevice.
Problem

Research questions and friction points this paper is trying to address.

Evaluating performance limitations of on-device LLMs due to compression.
Identifying optimal quantization thresholds for model efficiency and accuracy.
Assessing system resource trade-offs in edge device LLM deployment.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic methodology for on-device LLM evaluation
Low-bit quantization outperforms high-bit smaller models
Quantization saves memory with marginal accuracy loss
🔎 Similar Papers
No similar papers found.
Q
Qingyu Song
The Chinese University of Hong Kong
Peiyu Liao
Peiyu Liao
Huawei
Wenqian Zhao
Wenqian Zhao
the Chinese University of Hong Kong
Deep LearningDesign Automation
Y
Yiwen Wang
Huawei
S
Shoubo Hu
Huawei
Hui-Ling Zhen
Hui-Ling Zhen
Huawei, Hong Kong
LLM InferenceAgentNumerical OptimizationNumerical Computation
N
Ning Jiang
Huawei
M
Mingxuan Yuan
Huawei