Mixture of Neuron Experts

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the low expert-layer parameter activation sparsity and suboptimal parameter utilization in Mixture-of-Experts (MoE) models during inference. We propose a neuron-granularity expert selection mechanism: during pretraining, neurons within each expert are ranked and pruned based on the magnitude of their activations induced by the gating projection, and a top-k strategy dynamically activates only the most responsive neurons per expert—without introducing additional routing parameters or inter-expert communication. This constitutes the first fine-grained, zero-overhead neuron-level sparsification for MoE models, significantly improving parameter efficiency and inference throughput. Experiments demonstrate that activating only 50% of MoE-layer parameters achieves performance on par with the full-parameter MoE baseline; moreover, under identical activation budgets, our method consistently outperforms prior approaches, achieving both high accuracy and low latency.

Technology Category

Application Category

📝 Abstract

In this work, we first explore whether the parameters activated by the MoE layer remain highly sparse at inference. We perform a sparsification study on several representative MoE models. For each expert, we rank parameters by the magnitude of their activations from the gate projection and progressively prune the activated subset. Pruning up to 60% of parameters within that subset causes only negligible task-performance degradation; substantial drops occur only after more than 90% are removed. We further decompose experts into neuron-granular MoE and visualize their activation values, finding that most neuron activations are near zero. This observation motivates us to select only high-activation neuron experts during pretraining. Based on this insight, we propose Mixture of Neuron Experts (MoNE). MoNE achieves neuron-granular expert selection by only applying a simple top-k selection within each expert, incurs negligible latency, and requires no additional routing parameters or inter-expert communication. Extensive experiments demonstrate that MoNE matches traditional MoE performance while activating only 50% of the MoE-layer parameters, and it consistently outperforms traditional MoE when compared at equal numbers of activated parameters. These results suggest that MoNE is a practical approach to improving parameter utilization and inference efficiency in MoE-like models.

Problem

Research questions and friction points this paper is trying to address.

Investigates sparsity of activated parameters in MoE models during inference

Proposes neuron-granular expert selection to improve parameter utilization efficiency

Achieves comparable performance while activating fewer parameters than traditional MoE

Innovation

Methods, ideas, or system contributions that make the work stand out.

MoNE enables neuron-level expert selection

Top-k selection within experts avoids routing parameters

Activates 50% parameters while matching MoE performance

🔎 Similar Papers

A Closer Look into Mixture-of-Experts in Large Language Models