QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Nonlinear operations (e.g., GeLU, Softmax) in Transformers incur high inference latency and substantial hardware overhead, limiting FPGA acceleration efficiency. To address this, we propose the first quantization-driven circuit sharing framework targeting common patterns across nonlinear operators: by identifying approximation reusability of diverse nonlinear functions under ultra-low-bitwidth (≤4-bit) quantization, we design a unified, configurable computing unit enabling cross-operator and cross-layer hardware resource reuse on FPGAs. Our approach jointly optimizes accuracy and efficiency—achieving 1.96× end-to-end speedup on mainstream Transformer models, reducing nonlinear module area by 52%, and surpassing FP16 baseline accuracy by up to 0.3% at 2–4-bit quantization.

Technology Category

Application Category

📝 Abstract
Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy -- and even substantially boosting accuracy under ultra-low-bit quantization.
Problem

Research questions and friction points this paper is trying to address.

Accelerating nonlinear operations in Transformer models
Reducing hardware resource requirements via circuit sharing
Maintaining model accuracy under ultra-low-bit quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-enabled FPGA acceleration framework
Leverages common patterns in nonlinear operations
Circuit-sharing design reduces hardware resource requirements
🔎 Similar Papers
No similar papers found.
Z
Zhixiong Zhao
Nanyang Technological University
Haomin Li
Haomin Li
The Children's Hospital, Zhejiang University School of Medicine
Medical InformaticsClincial Decision SupportMedical AI
Fangxin Liu
Fangxin Liu
Shanghai Jiao Tong University
In-memory Computing、Brian-inspired Neuromorphic Computing
Y
Yuncheng Lu
Nanyang Technological University
Z
Zongwu Wang
School of Computer Science, Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
T
Tao Yang
Huawei Technologies Co., Ltd
L
Li Jiang
School of Computer Science, Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
H
Haibing Guan
School of Computer Science, Shanghai Jiao Tong University