🤖 AI Summary
Deploying Mixture-of-Experts Vision Transformers (MoE-ViTs) on resource-constrained FPGAs remains challenging due to their high computational and memory demands.
Method: This work proposes a two-stage co-quantization framework coupled with a resource-aware streaming acceleration architecture. It integrates scale reparameterized quantization, low-latency streaming attention kernels, reusable linear operators, and FPGA-customized pipelined scheduling to jointly optimize accuracy and efficiency.
Contribution/Results: Compared to state-of-the-art FPGA-based MoE accelerators, our design achieves 155 FPS throughput (5.35× improvement), reduces energy consumption by over 80%, and incurs less than 1% Top-1 accuracy degradation. To the best of our knowledge, this is the first hardware implementation of MoE-ViTs on FPGAs that simultaneously delivers high throughput, ultra-low power, and near-lossless accuracy—establishing a scalable solution for large-model vision inference at the edge.
📝 Abstract
Vision Transformers (ViTs) exhibit superior performance in computer vision tasks but face deployment challenges on resource-constrained devices due to high computational/memory demands. While Mixture-of-Experts Vision Transformers (MoE-ViTs) mitigate this through a scalable architecture with sub-linear computational growth, their hardware implementation on FPGAs remains constrained by resource limitations. This paper proposes a novel accelerator for efficiently implementing quantized MoE models on FPGAs through two key innovations: (1) A dual-stage quantization scheme combining precision-preserving complex quantizers with hardware-friendly simplified quantizers via scale reparameterization, with only 0.28 $%$ accuracy loss compared to full precision; (2) A resource-aware accelerator architecture featuring latency-optimized streaming attention kernels and reusable linear operators, effectively balancing performance and resource consumption. Experimental results demonstrate that our accelerator achieves nearly 155 frames per second, a 5.35$ imes$ improvement in throughput, and over $80%$ energy reduction compared to state-of-the-art (SOTA) FPGA MoE accelerators, while maintaining $<1%$ accuracy loss across vision benchmarks. Our implementation is available at https://github.com/DJ000011/CoQMoE.