UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing resource utilization and performance when deploying Mixture-of-Experts Vision Transformers (MoE-ViT) on FPGAs, this paper proposes an end-to-end configurable hardware architecture. We innovatively design a streaming attention core and a reusable linear core operating in concert, enabling dynamic expert routing and fine-grained parallelism. A two-stage heuristic search algorithm is introduced to automatically optimize architectural parameters under cross-platform resource constraints. Leveraging streaming computation, mixed-precision execution, and low-latency memory access optimization, our implementation achieves 1.34× and 3.35× throughput improvements, and 1.75× and 1.54× energy-efficiency gains on Xilinx ZCU102 and Alveo U280 FPGAs, respectively—significantly outperforming state-of-the-art FPGA-based MoE-ViT accelerators.

Technology Category

Application Category

📝 Abstract
Compared to traditional Vision Transformers (ViT), Mixture-of-Experts Vision Transformers (MoE-ViT) are introduced to scale model size without a proportional increase in computational complexity, making them a new research focus. Given the high performance and reconfigurability, FPGA-based accelerators for MoE-ViT emerge, delivering substantial gains over general-purpose processors. However, existing accelerators often fall short of fully exploring the design space, leading to suboptimal trade-offs between resource utilization and performance. To overcome this problem, we introduce UbiMoE, a novel end-to-end FPGA accelerator tailored for MoE-ViT. Leveraging the unique computational and memory access patterns of MoE-ViTs, we develop a latency-optimized streaming attention kernel and a resource-efficient reusable linear kernel, effectively balancing performance and resource consumption. To further enhance design efficiency, we propose a two-stage heuristic search algorithm that optimally tunes hardware parameters for various FPGA resource constraints. Compared to state-of-the-art (SOTA) FPGA designs, UbiMoE achieves 1.34x and 3.35x throughput improvements for MoE-ViT on Xilinx ZCU102 and Alveo U280 platforms, respectively, while enhancing energy efficiency by 1.75x and 1.54x. Our implementation is available at https://github.com/DJ000011/UbiMoE.
Problem

Research questions and friction points this paper is trying to address.

Optimize FPGA accelerator for MoE-ViT
Balance performance and resource utilization
Enhance energy efficiency in vision transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

FPGA-based MoE-ViT accelerator design
Latency-optimized streaming attention kernel
Two-stage heuristic search algorithm
🔎 Similar Papers
No similar papers found.
J
Jiale Dong
University of Science and Technology of China, Hefei, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China
Wenqi Lou
Wenqi Lou
University of Science and Technology of China
FPGA AcceleratorAlgorithm-hardware Co-Optimization
Z
Zhendong Zheng
University of Science and Technology of China, Hefei, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China
Y
Yunji Qin
University of Science and Technology of China, Hefei, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China
L
Lei Gong
University of Science and Technology of China, Hefei, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China
C
Chao Wang
University of Science and Technology of China, Hefei, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China
X
Xuehai Zhou
University of Science and Technology of China, Hefei, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China