TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Deploying large language models (LLMs) on edge FPGAs (e.g., AMD KV260) faces three interrelated energy-efficiency bottlenecks: limited compute throughput, constrained on-chip memory bandwidth, and high prefill latency. To address these challenges, this paper introduces the first hardware acceleration architecture for ternary LLMs that jointly optimizes prefill and autoregressive decoding. Our design innovatively integrates a lookup-table-based ternary matrix multiplication engine, a reverse-ordered attention reordering module, and a unified normalization–quantization–dequantization unit. Employing 1.58-bit ternary weights and 8-bit activations, the architecture achieves up to 9 tokens/s throughput under a 7 W power budget for 1024-token contexts, while reducing prefill latency to 0.55–1.15 seconds for 64–128-token prompts. This represents a significant advancement in energy-efficient generative AI deployment at the edge.

Technology Category

Application Category

📝 Abstract

Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected latency of the prefill phase. We present TeLLMe, the first ternary LLM accelerator for low-power FPGAs (e.g., AMD KV260) that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. Our contributions include: (1) a table-lookup matrix engine for ternary matmul that merges grouped activations with online precomputation to minimize resource use; (2) a fused, bandwidth-efficient attention module featuring a reversed reordering scheme to accelerate prefill; and (3) a tightly integrated normalization and quantization--dequantization unit optimized for ultra-low-bit inference. Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts and prefill latencies of 0.55--1.15 s for 64--128 token prompts, marking a significant energy-efficiency advance and establishing a new edge FPGA benchmark for generative AI.

Problem

Research questions and friction points this paper is trying to address.

Deploying LLMs on edge platforms with high computational demands

Overcoming limited on-chip resources and power budgets for edge deployment

Reducing latency in prefill phase for efficient autoregressive decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary matrix engine with grouped activations

Fused attention module for efficient prefill

Integrated normalization and quantization unit

🔎 Similar Papers

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective