UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA

📅 2025-02-08

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the challenge of balancing resource utilization and performance when deploying Mixture-of-Experts Vision Transformers (MoE-ViT) on FPGAs, this paper proposes an end-to-end configurable hardware architecture. We innovatively design a streaming attention core and a reusable linear core operating in concert, enabling dynamic expert routing and fine-grained parallelism. A two-stage heuristic search algorithm is introduced to automatically optimize architectural parameters under cross-platform resource constraints. Leveraging streaming computation, mixed-precision execution, and low-latency memory access optimization, our implementation achieves 1.34× and 3.35× throughput improvements, and 1.75× and 1.54× energy-efficiency gains on Xilinx ZCU102 and Alveo U280 FPGAs, respectively—significantly outperforming state-of-the-art FPGA-based MoE-ViT accelerators.

Technology Category

Application Category

📝 Abstract

Compared to traditional Vision Transformers (ViT), Mixture-of-Experts Vision Transformers (MoE-ViT) are introduced to scale model size without a proportional increase in computational complexity, making them a new research focus. Given the high performance and reconfigurability, FPGA-based accelerators for MoE-ViT emerge, delivering substantial gains over general-purpose processors. However, existing accelerators often fall short of fully exploring the design space, leading to suboptimal trade-offs between resource utilization and performance. To overcome this problem, we introduce UbiMoE, a novel end-to-end FPGA accelerator tailored for MoE-ViT. Leveraging the unique computational and memory access patterns of MoE-ViTs, we develop a latency-optimized streaming attention kernel and a resource-efficient reusable linear kernel, effectively balancing performance and resource consumption. To further enhance design efficiency, we propose a two-stage heuristic search algorithm that optimally tunes hardware parameters for various FPGA resource constraints. Compared to state-of-the-art (SOTA) FPGA designs, UbiMoE achieves 1.34x and 3.35x throughput improvements for MoE-ViT on Xilinx ZCU102 and Alveo U280 platforms, respectively, while enhancing energy efficiency by 1.75x and 1.54x. Our implementation is available at https://github.com/DJ000011/UbiMoE.

Problem

Research questions and friction points this paper is trying to address.

Optimize FPGA accelerator for MoE-ViT

Balance performance and resource utilization

Enhance energy efficiency in vision transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

FPGA-based MoE-ViT accelerator design

Latency-optimized streaming attention kernel

Two-stage heuristic search algorithm

🔎 Similar Papers

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models