NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address the trade-off between throughput and accuracy in long-chain-of-thought reasoning tasks, this paper introduces Nemotron-Nano-9B-v2—a language model based on a hybrid Mamba-Transformer architecture. Its core innovation replaces most self-attention layers with computationally efficient Mamba-2 blocks, while leveraging FP8-precision pretraining and Minitron-based knowledge distillation to achieve lightweighting (9B parameters) without sacrificing performance on a 12B-parameter foundation. The model supports 128K-token context windows on a single A10G GPU. Experiments demonstrate that, matching or exceeding the accuracy of same-scale models (e.g., Qwen3-8B), Nemotron-Nano-9B-v2 achieves up to 6× higher inference throughput under 8K-input + 16K-output conditions. This work advances the development of deployable large language models that simultaneously deliver high accuracy, high throughput, and efficient memory utilization.

Technology Category

Application Category

📝 Abstract

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

Problem

Research questions and friction points this paper is trying to address.

Improving inference throughput for reasoning workloads

Achieving state-of-the-art accuracy in similarly-sized models

Enabling long-context reasoning on constrained GPU memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer architecture for reasoning

FP8 pre-training on 20 trillion tokens

Minitron compression for 128k token inference

🔎 Similar Papers

The Buffer Mechanism for Multi-Step Information Reasoning in Language Models