SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

๐Ÿ“… 2025-10-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing token-level Mixture-of-Experts (MoE) models suffer from semantic contamination across experts, imbalanced expert load, and capacity bottlenecks due to routing entire tokens holistically. This paper proposes SliceMoE, which partitions hidden vectors into contiguous slices and routes each slice independently to expertsโ€”enabling finer-grained, more balanced model scaling. We introduce slice-level capacity loss and cross-slice dropout to encourage interpretable expert specialization in semantic versus syntactic capabilities. A lightweight shared router predicts top-k experts per slice, and fused batched GEMM operations optimize computation. Experiments on language modeling, machine translation, and text classification show that SliceMoE achieves 1.7ร— faster inference than dense baselines and reduces perplexity by 12โ€“18% compared to parameter-matched token-level MoE, while significantly improving expert load balance.

Technology Category

Application Category

๐Ÿ“ Abstract
Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a sparse subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialization. We introduce SliceMoE, an architecture that routes contiguous slices of a token's hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are reassembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilization is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels. Experiments on WikiText-103 language modeling, WMT En-De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12 to 18 percent lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic subspaces.
Problem

Research questions and friction points this paper is trying to address.

Improves token-level routing bottlenecks in MoE transformers
Enhances expert specialization and load balancing efficiency
Optimizes computational performance while maintaining model quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Routes embedding slices instead of tokens
Uses lightweight shared router per slice
Employs slice-level capacity loss and cross-slice dropout