🤖 AI Summary
This work addresses the challenges of efficiency and scalability in large language models for agentic reasoning tasks by introducing an efficient, open-source 120B-parameter model. The architecture integrates a Mamba-Transformer hybrid design and a novel LatentMoE (Mixture-of-Experts) mechanism to enhance parameter and FLOP efficiency. It is the first model pretrained entirely in NVFP4 format and further optimized through supervised fine-tuning and reinforcement learning. Native speculative decoding is enabled via integrated MTP layers, supporting context lengths up to one million tokens. Experimental results demonstrate that the model matches state-of-the-art accuracy on standard benchmarks while achieving 2.2× and 7.5× higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. All training data and model checkpoints are publicly released.
📝 Abstract
We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.