Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly achieving long-context modeling and efficient inference in agent reasoning tasks, this paper introduces 30B-A3B—the first sparse Mixture-of-Experts (MoE) language model integrating Mamba and Transformer architectures. It synergistically combines Mamba’s linear-time complexity for scalable context processing with Transformer’s strong reasoning capabilities, enabling effective joint modeling at 1M-token context length. A fine-grained expert routing mechanism activates only 42% of parameters while surpassing prior models in accuracy. Trained on 25T tokens, followed by multi-stage supervised fine-tuning and large-scale environmental reinforcement learning, 30B-A3B significantly outperforms open-source baselines—including Nemotron 2 Nano and Qwen3-30B—across major benchmarks. It achieves state-of-the-art inference throughput (+3.3×), and both its base and post-trained checkpoints are fully open-sourced.

Technology Category

Application Category

📝 Abstract
We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.
Problem

Research questions and friction points this paper is trying to address.

Develops efficient hybrid Mamba-Transformer model for agentic reasoning
Improves inference throughput while maintaining high accuracy benchmarks
Enhances agentic reasoning and chat abilities with long context support
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer Mixture-of-Experts architecture
Large-scale pretraining with supervised fine-tuning and RL
High inference throughput with low parameter activation
🔎 Similar Papers
No similar papers found.