🤖 AI Summary
This work addresses the performance saturation of existing single-dimension fine-grained experts beyond a certain intermediate dimension threshold, which limits further gains. To overcome this limitation, the paper introduces the first dual-dimensional fine-grained expert architecture that spans both intermediate and output dimensions. It proposes a two-level sparse feedforward computation scheme and a dedicated routing mechanism to enhance expert specialization, alongside an efficient model upcycling strategy that enables low-cost scaling. Evaluated across ten standard benchmarks, the approach substantially outperforms the strongest baselines, achieving a 6× improvement in parameter efficiency, a 281× reduction in prefill latency, and a 136× increase in decoding throughput.
📝 Abstract
As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.