LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Deploying large language models (LLMs) on resource-constrained edge devices—such as the Raspberry Pi—is hindered by limited compute, high power consumption, and substantial inference latency. To address these challenges, this work proposes a co-designed quantization and architectural optimization framework for low-power edge AI. First, we introduce a flexible k-bit post-training quantization (PTQ) scheme supporting 2-, 4-, 6-, and 8-bit weight quantization. Second, we propose a novel ternary quantization strategy integrated with quantization-aware training (QAT), specifically tailored for BitNet, which significantly mitigates accuracy degradation at ultra-low bitwidths (≤2 bits). Third, hardware-aware optimizations enable end-to-end acceleration. Experiments on the Raspberry Pi demonstrate a 3.2× throughput improvement, substantial energy reduction, and generation quality approaching that of full-precision FP16 models. To our knowledge, this is the first work to achieve high-fidelity, low-latency, and energy-efficient real-time conversational AI on Raspberry Pi–class devices.

Technology Category

Application Category

📝 Abstract

Deploying Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization techniques to enable high-throughput, energy-efficient execution of LLMs on low-power embedded systems. Our approach leverages k-quantization, a Post-Training Quantization (PTQ) method designed for different bit-widths, enabling efficient 2-bit, 4-bit, 6-bit, and 8-bit weight quantization. Additionally, we employ ternary quantization using Quantization-Aware Training (QAT) for BitNet models, allowing for more effective adaptation to lower-bit representations while preserving accuracy. Our findings highlight the potential of quantized LLMs for real-time conversational AI on edge devices, paving the way for low-power, high-efficiency AI deployment in mobile and embedded applications. This study demonstrates that aggressive quantization strategies can significantly reduce energy consumption while maintaining inference quality, making LLMs practical for resource-limited environments.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLMs for high-throughput on Raspberry Pi

Reducing computational efficiency and power consumption

Enabling real-time conversational AI on edge devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses k-quantization for multi-bit weight optimization

Employs ternary quantization with QAT for BitNet

Reduces energy via aggressive quantization strategies

🔎 Similar Papers

No similar papers found.