BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

269K/year

🤖 AI Summary

To address the non-differentiability of conventional MoE routing, low block-level sparsity, and poor compatibility with edge-device acceleration and speculative decoding, this paper proposes BlockFFN—a differentiable Mixture-of-Experts architecture tailored for on-device deployment. Our method introduces three key innovations: (1) a ReLU-RMSNorm differentiable router enabling flexible, end-to-end optimized expert selection; (2) the first joint modeling of activation sparsity and speculative decoding, formalized as a block-level sparse-aware training objective that significantly boosts 8-token block sparsity; and (3) efficient kernel implementations achieving >80% token-level and 70% block-level sparsity. Evaluated on real edge devices, BlockFFN delivers up to 3.67× inference speedup over dense baselines while outperforming existing MoE approaches in both efficiency and accuracy.

Technology Category

Application Category

📝 Abstract

To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$ imes$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).

Problem

Research questions and friction points this paper is trying to address.

Address non-differentiable routing in vanilla MoE models

Improve chunk-level sparsity for end-side acceleration

Combine activation sparsity with speculative decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Router integrates ReLU activation and RMSNorm

CLS-aware training objectives enhance sparsity

Combines activation sparsity with speculative decoding

🔎 Similar Papers

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models