LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts

📅 2026-01-26
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
It remains unclear whether existing Mixture-of-Experts (MoE) architectures achieve near-optimal accuracy per unit of computation or parameter count. This work systematically analyzes the performance bottlenecks of MoE across diverse deployment scenarios from a hardware-software co-design perspective and proposes LatentMoE, a novel architecture that significantly improves accuracy per FLOP and per parameter. Through large-scale design space exploration—spanning up to 95 billion parameters and trained on 1 trillion tokens—combined with theoretically grounded optimization strategies, LatentMoE consistently outperforms standard MoE across multiple model scales. The effectiveness and scalability of LatentMoE have been validated through its successful deployment in the Nemotron-3 Super and Ultra models, demonstrating substantial gains in both efficiency and performance.

Technology Category

Application Category

📝 Abstract
Mixture of Experts (MoEs) have become a central component of many state-of-the-art open-source and proprietary large language models. Despite their widespread adoption, it remains unclear how close existing MoE architectures are to optimal with respect to inference cost, as measured by accuracy per floating-point operation and per parameter. In this work, we revisit MoE design from a hardware-software co-design perspective, grounded in empirical and theoretical considerations. We characterize key performance bottlenecks across diverse deployment regimes, spanning offline high-throughput execution and online, latency-critical inference. Guided by these insights, we introduce LatentMoE, a new model architecture resulting from systematic design exploration and optimized for maximal accuracy per unit of compute. Empirical design space exploration at scales of up to 95B parameters and over a 1T-token training horizon, together with supporting theoretical analysis, shows that LatentMoE consistently outperforms standard MoE architectures in terms of accuracy per FLOP and per parameter. Given its strong performance, the LatentMoE architecture has been adopted by the flagship Nemotron-3 Super and Ultra models and scaled to substantially larger regimes, including longer token horizons and larger model sizes, as reported in Nvidia et al. (arXiv:2512.20856).
Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts
accuracy per FLOP
accuracy per parameter
inference cost
model efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts
hardware-software co-design
accuracy per FLOP
LatentMoE
efficient inference
🔎 Similar Papers
No similar papers found.