LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on resource-constrained edge devices—such as the Raspberry Pi—is hindered by limited compute, high power consumption, and substantial inference latency. To address these challenges, this work proposes a co-designed quantization and architectural optimization framework for low-power edge AI. First, we introduce a flexible k-bit post-training quantization (PTQ) scheme supporting 2-, 4-, 6-, and 8-bit weight quantization. Second, we propose a novel ternary quantization strategy integrated with quantization-aware training (QAT), specifically tailored for BitNet, which significantly mitigates accuracy degradation at ultra-low bitwidths (≤2 bits). Third, hardware-aware optimizations enable end-to-end acceleration. Experiments on the Raspberry Pi demonstrate a 3.2× throughput improvement, substantial energy reduction, and generation quality approaching that of full-precision FP16 models. To our knowledge, this is the first work to achieve high-fidelity, low-latency, and energy-efficient real-time conversational AI on Raspberry Pi–class devices.

Technology Category

Application Category

📝 Abstract
Deploying Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization techniques to enable high-throughput, energy-efficient execution of LLMs on low-power embedded systems. Our approach leverages k-quantization, a Post-Training Quantization (PTQ) method designed for different bit-widths, enabling efficient 2-bit, 4-bit, 6-bit, and 8-bit weight quantization. Additionally, we employ ternary quantization using Quantization-Aware Training (QAT) for BitNet models, allowing for more effective adaptation to lower-bit representations while preserving accuracy. Our findings highlight the potential of quantized LLMs for real-time conversational AI on edge devices, paving the way for low-power, high-efficiency AI deployment in mobile and embedded applications. This study demonstrates that aggressive quantization strategies can significantly reduce energy consumption while maintaining inference quality, making LLMs practical for resource-limited environments.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLMs for high-throughput on Raspberry Pi
Reducing computational efficiency and power consumption
Enabling real-time conversational AI on edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses k-quantization for multi-bit weight optimization
Employs ternary quantization with QAT for BitNet
Reduces energy via aggressive quantization strategies
🔎 Similar Papers
No similar papers found.
M
Mahsa Ardakani
Department of Computer Science and Engineering, University of South Carolina
J
Jinendra Malekar
Department of Computer Science and Engineering, University of South Carolina
Ramtin Zand
Ramtin Zand
Assistant Professor, University of South Carolina
Edge ComputingNeuromorphic ComputingIn-Memory ComputingMachine LearningProcessing-In-Memory