🤖 AI Summary
This study addresses the joint optimization of model capability, development efficiency, and system resource constraints for local large language model (LLM) deployment on resource-constrained edge devices (e.g., AI PCs). We systematically evaluate 12 open-weight LLMs (0.5B–14B parameters) under seven post-training quantization (PTQ) schemes on CPU hardware. Our empirical analysis reveals a near-linear relationship between effective bits per weight (BPW) and inference latency, memory footprint, and power consumption—identifying ~3.5 BPW as a critical performance inflection point where low-bit large models outperform high-precision smaller ones. Quantization to low BPW incurs negligible accuracy degradation while reducing memory usage by up to 72% and shifting power consumption dominance toward compute-intensive operations. The work delivers a reproducible, empirically grounded configuration guide for edge LLM deployment, establishing an engineering paradigm that balances privacy preservation with computational efficiency.
📝 Abstract
The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around $sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) Quantization with low BPW incurs marginal accuracy loss but significant memory savings. 4) Determined by low-level implementation specifics power consumption on CPU, where computation-intensive operations spend more power than memory-intensive ones. These findings offer crucial insights and practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://github.com/simmonssong/LLMOnDevice.