CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Deploying Mixture-of-Experts Vision Transformers (MoE-ViTs) on resource-constrained FPGAs remains challenging due to their high computational and memory demands. Method: This work proposes a two-stage co-quantization framework coupled with a resource-aware streaming acceleration architecture. It integrates scale reparameterized quantization, low-latency streaming attention kernels, reusable linear operators, and FPGA-customized pipelined scheduling to jointly optimize accuracy and efficiency. Contribution/Results: Compared to state-of-the-art FPGA-based MoE accelerators, our design achieves 155 FPS throughput (5.35× improvement), reduces energy consumption by over 80%, and incurs less than 1% Top-1 accuracy degradation. To the best of our knowledge, this is the first hardware implementation of MoE-ViTs on FPGAs that simultaneously delivers high throughput, ultra-low power, and near-lossless accuracy—establishing a scalable solution for large-model vision inference at the edge.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) exhibit superior performance in computer vision tasks but face deployment challenges on resource-constrained devices due to high computational/memory demands. While Mixture-of-Experts Vision Transformers (MoE-ViTs) mitigate this through a scalable architecture with sub-linear computational growth, their hardware implementation on FPGAs remains constrained by resource limitations. This paper proposes a novel accelerator for efficiently implementing quantized MoE models on FPGAs through two key innovations: (1) A dual-stage quantization scheme combining precision-preserving complex quantizers with hardware-friendly simplified quantizers via scale reparameterization, with only 0.28 $%$ accuracy loss compared to full precision; (2) A resource-aware accelerator architecture featuring latency-optimized streaming attention kernels and reusable linear operators, effectively balancing performance and resource consumption. Experimental results demonstrate that our accelerator achieves nearly 155 frames per second, a 5.35$ imes$ improvement in throughput, and over $80%$ energy reduction compared to state-of-the-art (SOTA) FPGA MoE accelerators, while maintaining $<1%$ accuracy loss across vision benchmarks. Our implementation is available at https://github.com/DJ000011/CoQMoE.

Problem

Research questions and friction points this paper is trying to address.

Efficient deployment of MoE-ViTs on resource-constrained FPGAs

Minimizing accuracy loss in quantized MoE models

Optimizing FPGA accelerator performance and energy efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stage quantization scheme for MoE models

Latency-optimized streaming attention kernels

Resource-aware accelerator architecture for FPGAs

🔎 Similar Papers

No similar papers found.