🤖 AI Summary
Nonlinear operations (e.g., GeLU, Softmax) in Transformers incur high inference latency and substantial hardware overhead, limiting FPGA acceleration efficiency. To address this, we propose the first quantization-driven circuit sharing framework targeting common patterns across nonlinear operators: by identifying approximation reusability of diverse nonlinear functions under ultra-low-bitwidth (≤4-bit) quantization, we design a unified, configurable computing unit enabling cross-operator and cross-layer hardware resource reuse on FPGAs. Our approach jointly optimizes accuracy and efficiency—achieving 1.96× end-to-end speedup on mainstream Transformer models, reducing nonlinear module area by 52%, and surpassing FP16 baseline accuracy by up to 0.3% at 2–4-bit quantization.
📝 Abstract
Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy -- and even substantially boosting accuracy under ultra-low-bit quantization.