SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing token-level Mixture-of-Experts (MoE) models suffer from semantic contamination across experts, imbalanced expert load, and capacity bottlenecks due to routing entire tokens holistically. This paper proposes SliceMoE, which partitions hidden vectors into contiguous slices and routes each slice independently to experts—enabling finer-grained, more balanced model scaling. We introduce slice-level capacity loss and cross-slice dropout to encourage interpretable expert specialization in semantic versus syntactic capabilities. A lightweight shared router predicts top-k experts per slice, and fused batched GEMM operations optimize computation. Experiments on language modeling, machine translation, and text classification show that SliceMoE achieves 1.7× faster inference than dense baselines and reduces perplexity by 12–18% compared to parameter-matched token-level MoE, while significantly improving expert load balance.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a sparse subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialization. We introduce SliceMoE, an architecture that routes contiguous slices of a token's hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are reassembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilization is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels. Experiments on WikiText-103 language modeling, WMT En-De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12 to 18 percent lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic subspaces.

Problem

Research questions and friction points this paper is trying to address.

Improves token-level routing bottlenecks in MoE transformers

Enhances expert specialization and load balancing efficiency

Optimizes computational performance while maintaining model quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Routes embedding slices instead of tokens

Uses lightweight shared router per slice

Employs slice-level capacity loss and cross-slice dropout

🔎 Similar Papers

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation