TerEffic: Highly Efficient Ternary LLM Inference on FPGA

๐Ÿ“… 2025-02-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high power consumption and low throughput bottlenecks caused by off-chip memory accesses when deploying large language models (LLMs) on edge devices, this paper proposes a fully on-chip ternary quantization inference architecture tailored for FPGAs. Methodologically, it integrates ternary weight compression, custom ternary compute units, deeply co-optimized on-chip memory hierarchy, and hardware-algorithm co-design. It introduces, for the first time, a scalable architecture supporting both pure on-chip execution and HBM-assisted modesโ€”enabling full LLM weight residency entirely on FPGA fabric. Experimental results demonstrate 12,700 tokens/sec for a 370M model (149ร— faster than Jetson Orin Nano) and 521 tokens/sec for a 2.7B model under HBM assistance (2ร— faster than an A100), with peak energy efficiency improvements up to 19ร—.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Model (LLM) deployment on edge devices is typically constrained by the need for off-chip memory access, leading to high power consumption and limited throughput. Ternary quantization for LLMs is promising in maintaining model accuracy while reducing memory footprint. However, existing accelerators have not exploited this potential for on-chip inference. We present TerEffic, an FPGA-based accelerator that carefully co-designs memory architecture and computational units to unlock highly efficient LLM inference with fully on-chip execution. Through weight compression, custom computational units, and memory hierarchy optimization, we achieve unprecedented efficiency by eliminating off-chip memory bandwidth bottlenecks. We propose two architectural variants: a fully on-chip design for smaller models and an HBM-assisted design for larger ones. When evaluated on a 370M parameter model with single-batch inference, our on-chip design achieves 12,700 tokens/sec (149 times higher than NVIDIA's Jetson Orin Nano) with a power efficiency of 467 tokens/sec/W (19 times better than Jetson Orin Nano). The HBM-assisted design provides 521 tokens/sec on a 2.7B parameter model (2 times higher than NVIDIA's A100) with 33W power consumption, achieving a power efficiency of 16 tokens/sec/W (8 times better than A100).
Problem

Research questions and friction points this paper is trying to address.

Optimize LLM inference on edge devices
Reduce power consumption via on-chip execution
Enhance efficiency with ternary quantization and FPGA
Innovation

Methods, ideas, or system contributions that make the work stand out.

FPGA-based ternary quantization
On-chip memory architecture
Custom computational units optimization
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chenyang Yin
School of Electronic Engineering and Computer Science, Peking University
Z
Zhenyu Bai
School of Computing, National University of Singapore
P
Pranav Venkatram
School of Computing, National University of Singapore
Shivam Aggarwal
Shivam Aggarwal
Senior AI Compiler Engineer, Renesas
Efficient AIModel CompressionQuantization
Z
Zhaoying Li
School of Computing, National University of Singapore
Tulika Mitra
Tulika Mitra
Professor of Computer Science, National University of Singapore
Design AutomationLow Power DesignEmbedded SystemsReal-Time Systems