CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying Mixture-of-Experts Vision Transformers (MoE-ViTs) on resource-constrained FPGAs remains challenging due to their high computational and memory demands. Method: This work proposes a two-stage co-quantization framework coupled with a resource-aware streaming acceleration architecture. It integrates scale reparameterized quantization, low-latency streaming attention kernels, reusable linear operators, and FPGA-customized pipelined scheduling to jointly optimize accuracy and efficiency. Contribution/Results: Compared to state-of-the-art FPGA-based MoE accelerators, our design achieves 155 FPS throughput (5.35× improvement), reduces energy consumption by over 80%, and incurs less than 1% Top-1 accuracy degradation. To the best of our knowledge, this is the first hardware implementation of MoE-ViTs on FPGAs that simultaneously delivers high throughput, ultra-low power, and near-lossless accuracy—establishing a scalable solution for large-model vision inference at the edge.

Technology Category

Application Category

📝 Abstract
Vision Transformers (ViTs) exhibit superior performance in computer vision tasks but face deployment challenges on resource-constrained devices due to high computational/memory demands. While Mixture-of-Experts Vision Transformers (MoE-ViTs) mitigate this through a scalable architecture with sub-linear computational growth, their hardware implementation on FPGAs remains constrained by resource limitations. This paper proposes a novel accelerator for efficiently implementing quantized MoE models on FPGAs through two key innovations: (1) A dual-stage quantization scheme combining precision-preserving complex quantizers with hardware-friendly simplified quantizers via scale reparameterization, with only 0.28 $%$ accuracy loss compared to full precision; (2) A resource-aware accelerator architecture featuring latency-optimized streaming attention kernels and reusable linear operators, effectively balancing performance and resource consumption. Experimental results demonstrate that our accelerator achieves nearly 155 frames per second, a 5.35$ imes$ improvement in throughput, and over $80%$ energy reduction compared to state-of-the-art (SOTA) FPGA MoE accelerators, while maintaining $<1%$ accuracy loss across vision benchmarks. Our implementation is available at https://github.com/DJ000011/CoQMoE.
Problem

Research questions and friction points this paper is trying to address.

Efficient deployment of MoE-ViTs on resource-constrained FPGAs
Minimizing accuracy loss in quantized MoE models
Optimizing FPGA accelerator performance and energy efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stage quantization scheme for MoE models
Latency-optimized streaming attention kernels
Resource-aware accelerator architecture for FPGAs
🔎 Similar Papers
No similar papers found.
J
Jiale Dong
University of Science and Technology of China, Hefei, China
H
Hao Wu
University of Science and Technology of China, Hefei, China
Z
Zihao Wang
University of Science and Technology of China, Hefei, China
Wenqi Lou
Wenqi Lou
University of Science and Technology of China
FPGA AcceleratorAlgorithm-hardware Co-Optimization
Z
Zhendong Zheng
University of Science and Technology of China, Hefei, China
L
Lei Gong
University of Science and Technology of China, Hefei, China
C
Chao Wang
University of Science and Technology of China, Hefei, China
X
Xuehai Zhou
University of Science and Technology of China, Hefei, China