QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Nonlinear operations (e.g., GeLU, Softmax) in Transformers incur high inference latency and substantial hardware overhead, limiting FPGA acceleration efficiency. To address this, we propose the first quantization-driven circuit sharing framework targeting common patterns across nonlinear operators: by identifying approximation reusability of diverse nonlinear functions under ultra-low-bitwidth (≤4-bit) quantization, we design a unified, configurable computing unit enabling cross-operator and cross-layer hardware resource reuse on FPGAs. Our approach jointly optimizes accuracy and efficiency—achieving 1.96× end-to-end speedup on mainstream Transformer models, reducing nonlinear module area by 52%, and surpassing FP16 baseline accuracy by up to 0.3% at 2–4-bit quantization.

Technology Category

Application Category

📝 Abstract

Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy -- and even substantially boosting accuracy under ultra-low-bit quantization.

Problem

Research questions and friction points this paper is trying to address.

Accelerating nonlinear operations in Transformer models

Reducing hardware resource requirements via circuit sharing

Maintaining model accuracy under ultra-low-bit quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-enabled FPGA acceleration framework

Leverages common patterns in nonlinear operations

Circuit-sharing design reduces hardware resource requirements

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration