Harnessing Large Language Models Locally: Empirical Results and Implications for AI PC

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study addresses the joint optimization of model capability, development efficiency, and system resource constraints for local large language model (LLM) deployment on resource-constrained edge devices (e.g., AI PCs). We systematically evaluate 12 open-weight LLMs (0.5B–14B parameters) under seven post-training quantization (PTQ) schemes on CPU hardware. Our empirical analysis reveals a near-linear relationship between effective bits per weight (BPW) and inference latency, memory footprint, and power consumption—identifying ~3.5 BPW as a critical performance inflection point where low-bit large models outperform high-precision smaller ones. Quantization to low BPW incurs negligible accuracy degradation while reducing memory usage by up to 72% and shifting power consumption dominance toward compute-intensive operations. The work delivers a reproducible, empirically grounded configuration guide for edge LLM deployment, establishing an engineering paradigm that balances privacy preservation with computational efficiency.

Technology Category

Application Category

📝 Abstract

The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around $sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) Quantization with low BPW incurs marginal accuracy loss but significant memory savings. 4) Determined by low-level implementation specifics power consumption on CPU, where computation-intensive operations spend more power than memory-intensive ones. These findings offer crucial insights and practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://github.com/simmonssong/LLMOnDevice.

Problem

Research questions and friction points this paper is trying to address.

Evaluating performance limitations of on-device LLMs due to compression.

Identifying optimal quantization thresholds for model efficiency and accuracy.

Assessing system resource trade-offs in edge device LLM deployment.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic methodology for on-device LLM evaluation

Low-bit quantization outperforms high-bit smaller models

Quantization saves memory with marginal accuracy loss

🔎 Similar Papers

No similar papers found.