🤖 AI Summary
To address the non-differentiability of conventional MoE routing, low block-level sparsity, and poor compatibility with edge-device acceleration and speculative decoding, this paper proposes BlockFFN—a differentiable Mixture-of-Experts architecture tailored for on-device deployment. Our method introduces three key innovations: (1) a ReLU-RMSNorm differentiable router enabling flexible, end-to-end optimized expert selection; (2) the first joint modeling of activation sparsity and speculative decoding, formalized as a block-level sparse-aware training objective that significantly boosts 8-token block sparsity; and (3) efficient kernel implementations achieving >80% token-level and 70% block-level sparsity. Evaluated on real edge devices, BlockFFN delivers up to 3.67× inference speedup over dense baselines while outperforming existing MoE approaches in both efficiency and accuracy.
📝 Abstract
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$ imes$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).