🤖 AI Summary
Deploying large language models (LLMs) on edge FPGAs (e.g., AMD KV260) faces three interrelated energy-efficiency bottlenecks: limited compute throughput, constrained on-chip memory bandwidth, and high prefill latency. To address these challenges, this paper introduces the first hardware acceleration architecture for ternary LLMs that jointly optimizes prefill and autoregressive decoding. Our design innovatively integrates a lookup-table-based ternary matrix multiplication engine, a reverse-ordered attention reordering module, and a unified normalization–quantization–dequantization unit. Employing 1.58-bit ternary weights and 8-bit activations, the architecture achieves up to 9 tokens/s throughput under a 7 W power budget for 1024-token contexts, while reducing prefill latency to 0.55–1.15 seconds for 64–128-token prompts. This represents a significant advancement in energy-efficient generative AI deployment at the edge.
📝 Abstract
Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected latency of the prefill phase. We present TeLLMe, the first ternary LLM accelerator for low-power FPGAs (e.g., AMD KV260) that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. Our contributions include: (1) a table-lookup matrix engine for ternary matmul that merges grouped activations with online precomputation to minimize resource use; (2) a fused, bandwidth-efficient attention module featuring a reversed reordering scheme to accelerate prefill; and (3) a tightly integrated normalization and quantization--dequantization unit optimized for ultra-low-bit inference. Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts and prefill latencies of 0.55--1.15 s for 64--128 token prompts, marking a significant energy-efficiency advance and establishing a new edge FPGA benchmark for generative AI.