BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the non-differentiability of conventional MoE routing, low block-level sparsity, and poor compatibility with edge-device acceleration and speculative decoding, this paper proposes BlockFFN—a differentiable Mixture-of-Experts architecture tailored for on-device deployment. Our method introduces three key innovations: (1) a ReLU-RMSNorm differentiable router enabling flexible, end-to-end optimized expert selection; (2) the first joint modeling of activation sparsity and speculative decoding, formalized as a block-level sparse-aware training objective that significantly boosts 8-token block sparsity; and (3) efficient kernel implementations achieving >80% token-level and 70% block-level sparsity. Evaluated on real edge devices, BlockFFN delivers up to 3.67× inference speedup over dense baselines while outperforming existing MoE approaches in both efficiency and accuracy.

Technology Category

Application Category

📝 Abstract
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$ imes$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).
Problem

Research questions and friction points this paper is trying to address.

Address non-differentiable routing in vanilla MoE models
Improve chunk-level sparsity for end-side acceleration
Combine activation sparsity with speculative decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Router integrates ReLU activation and RMSNorm
CLS-aware training objectives enhance sparsity
Combines activation sparsity with speculative decoding