Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
This work addresses the challenges of efficiency and scalability in large language models for agentic reasoning tasks by introducing an efficient, open-source 120B-parameter model. The architecture integrates a Mamba-Transformer hybrid design and a novel LatentMoE (Mixture-of-Experts) mechanism to enhance parameter and FLOP efficiency. It is the first model pretrained entirely in NVFP4 format and further optimized through supervised fine-tuning and reinforcement learning. Native speculative decoding is enabled via integrated MTP layers, supporting context lengths up to one million tokens. Experimental results demonstrate that the model matches state-of-the-art accuracy on standard benchmarks while achieving 2.2× and 7.5× higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. All training data and model checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract
We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
agentic reasoning
inference throughput
hybrid architecture
large language model
Innovation

Methods, ideas, or system contributions that make the work stand out.

LatentMoE
MTP layers
NVFP4 pre-training
Mamba-Attention hybrid
speculative decoding
A
Aakshita Chandiramani
A
Aaron Blakeman
A
Abdullahi Olaoye
Abhibha Gupta
Abhibha Gupta
University of Pittsburgh
Natural language processing
A
Abhilash Somasamudramath
Abhinav Khattar
Abhinav Khattar
NVIDIA
Machine LearningNatural Language ProcessingDeep Learning
A
Adeola Adesoba
A
Adi Renduchintala
A
Adil Asif
A
Aditya Agrawal
Aditya Vavre
Aditya Vavre
University of Texas at Austin
Natural Language Processing
A
Ahmad Kiswani
Aishwarya Padmakumar
Aishwarya Padmakumar
NVIDIA
Natural Language ProcessingReinforcement LearningRobotics
A
Ajay Hotchandani
A
Akanksha Shukla
Akhiad Bercovich
Akhiad Bercovich
PhD candidate, Weizmann Institute of Science
Single Cell GenomicsEpigenomicsMachine LearningDNA language/regulation modelsefficient LLMs
A
Aleksander Ficek
A
Aleksandr Shaposhnikov
A
Alex Gronskiy
A
Alex Kondratenko
A
Alex Neefus
A
Alex Steiner
Alex Yang
Alex Yang
Georgia Institute of Technology
Human-Computer InteractionData VisualizationVR/AR