COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address computational, memory, and energy-efficiency bottlenecks in deploying Transformers on edge devices, this work proposes an algorithm-architecture co-optimized 1-bit binary accelerator. Departing from conventional binary (±1) or ternary (−1/0/+1) multiplication, it introduces the first hardware implementation of a *true* 1-bit multiplier supporting ternary inputs (−1/0/+1). It further designs a hardware-friendly binary attention module and constructs an end-to-end binary Transformer inference engine. Implemented on an edge FPGA, the accelerator achieves 3,894.7 GOPS throughput and 448.7 GOPS/W energy efficiency—311× more energy-efficient than a GPU and 3.5× higher throughput than the state-of-the-art binary accelerator—while incurring negligible accuracy degradation.

Technology Category

Application Category

📝 Abstract
Transformer-based models have demonstrated superior performance in various fields, including natural language processing and computer vision. However, their enormous model size and high demands in computation, memory, and communication limit their deployment to edge platforms for local, secure inference. Binary transformers offer a compact, low-complexity solution for edge deployment with reduced bandwidth needs and acceptable accuracy. However, existing binary transformers perform inefficiently on current hardware due to the lack of binary specific optimizations. To address this, we introduce COBRA, an algorithm-architecture co-optimized binary Transformer accelerator for edge computing. COBRA features a real 1-bit binary multiplication unit, enabling matrix operations with -1, 0, and +1 values, surpassing ternary methods. With further hardware-friendly optimizations in the attention block, COBRA achieves up to 3,894.7 GOPS throughput and 448.7 GOPS/Watt energy efficiency on edge FPGAs, delivering a 311x energy efficiency improvement over GPUs and a 3.5x throughput improvement over the state-of-the-art binary accelerator, with only negligible inference accuracy degradation.
Problem

Research questions and friction points this paper is trying to address.

Optimizing binary transformers for efficient edge inference
Reducing computation and memory demands for edge deployment
Improving hardware efficiency of binary transformers with co-optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithm-architecture co-optimized binary Transformer accelerator
Real 1-bit binary multiplication unit for efficient operations
Hardware-friendly optimizations in attention block for high performance
🔎 Similar Papers
No similar papers found.
Ye Qiao
Ye Qiao
Ph.D. Candidate, University of California, Irvine
Machine LearningComputer ArchitectureComputer VisionEdge ComputingIn-memory Computing
Z
Zhiheng Cheng
Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Y
Yian Wang
Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Y
Yifan Zhang
Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Y
Yunzhe Deng
Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Sitao Huang
Sitao Huang
Assistant Professor of EECS, University of California Irvine
Hardware AccelerationHigh-Level SynthesisFPGAParallel ComputingGPU