TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on edge FPGAs (e.g., AMD KV260) faces three interrelated energy-efficiency bottlenecks: limited compute throughput, constrained on-chip memory bandwidth, and high prefill latency. To address these challenges, this paper introduces the first hardware acceleration architecture for ternary LLMs that jointly optimizes prefill and autoregressive decoding. Our design innovatively integrates a lookup-table-based ternary matrix multiplication engine, a reverse-ordered attention reordering module, and a unified normalization–quantization–dequantization unit. Employing 1.58-bit ternary weights and 8-bit activations, the architecture achieves up to 9 tokens/s throughput under a 7 W power budget for 1024-token contexts, while reducing prefill latency to 0.55–1.15 seconds for 64–128-token prompts. This represents a significant advancement in energy-efficient generative AI deployment at the edge.

Technology Category

Application Category

📝 Abstract
Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected latency of the prefill phase. We present TeLLMe, the first ternary LLM accelerator for low-power FPGAs (e.g., AMD KV260) that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. Our contributions include: (1) a table-lookup matrix engine for ternary matmul that merges grouped activations with online precomputation to minimize resource use; (2) a fused, bandwidth-efficient attention module featuring a reversed reordering scheme to accelerate prefill; and (3) a tightly integrated normalization and quantization--dequantization unit optimized for ultra-low-bit inference. Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts and prefill latencies of 0.55--1.15 s for 64--128 token prompts, marking a significant energy-efficiency advance and establishing a new edge FPGA benchmark for generative AI.
Problem

Research questions and friction points this paper is trying to address.

Deploying LLMs on edge platforms with high computational demands
Overcoming limited on-chip resources and power budgets for edge deployment
Reducing latency in prefill phase for efficient autoregressive decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary matrix engine with grouped activations
Fused attention module for efficient prefill
Integrated normalization and quantization unit
Ye Qiao
Ye Qiao
Ph.D. Candidate, University of California, Irvine
Machine LearningComputer ArchitectureComputer VisionEdge ComputingIn-memory Computing
Z
Zhiheng Cheng
Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Y
Yifan Zhang
Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Y
Yian Wang
Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Sitao Huang
Sitao Huang
Assistant Professor of EECS, University of California Irvine
Hardware AccelerationHigh-Level SynthesisFPGAParallel ComputingGPU