Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Deploying large language models (LLMs) efficiently on resource-constrained edge devices remains challenging due to trade-offs among energy efficiency, inference latency, and task accuracy. Method: This work systematically evaluates 28 quantized LLMs from Ollama on Raspberry Pi 4 across five benchmark suites—CommonsenseQA, BIG-Bench Hard, TruthfulQA, GSM8K, and HumanEval—using hardware-level power measurement via INA219 for fine-grained energy profiling. Contribution/Results: It presents the first integrated analysis combining precise energy measurement with multidimensional LLM evaluation, uncovering a three-way trade-off among quantization level, task type, and hardware constraints. We propose a sustainability-aware optimization framework for edge LLM configuration. Experiments identify Pareto-optimal configurations achieving 42% energy reduction, sub-3.2-second per-query latency, and 68.5% TruthfulQA accuracy—delivering reproducible, practical deployment guidelines for edge AI.

Technology Category

Application Category

📝 Abstract

Deploying Large Language Models (LLMs) on edge devices presents significant challenges due to computational constraints, memory limitations, inference speed, and energy consumption. Model quantization has emerged as a key technique to enable efficient LLM inference by reducing model size and computational overhead. In this study, we conduct a comprehensive analysis of 28 quantized LLMs from the Ollama library, which applies by default Post-Training Quantization (PTQ) and weight-only quantization techniques, deployed on an edge device (Raspberry Pi 4 with 4GB RAM). We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Models are benchmarked on five standardized datasets (CommonsenseQA, BIG-Bench Hard, TruthfulQA, GSM8K, and HumanEval), and we employ a high-resolution, hardware-based energy measurement tool to capture real-world power consumption. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings, highlighting configurations that optimize LLM deployment for resource-constrained environments. By integrating hardware-level energy profiling with LLM benchmarking, this study provides actionable insights for sustainable AI, bridging a critical gap in existing research on energy-aware LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

Evaluating quantized LLMs for edge AI energy efficiency

Assessing trade-offs between accuracy and inference latency

Optimizing LLM deployment on resource-constrained edge devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantized LLMs for edge efficiency

Hardware-based energy measurement profiling

Multi-level quantization trade-off analysis

🔎 Similar Papers

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge