Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

📅 2025-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing inference efficiency and accuracy in large language models, this paper proposes a hybrid Mamba-Transformer architecture: Mamba layers—featuring O(N) computational complexity—replace most self-attention layers in Transformer blocks, yielding 8B and 47B/56B-scale models. We introduce MiniPuzzle, the first structured pruning–based compression method combined with knowledge distillation, to achieve effective model lightweighting. Additionally, we design a stable FP8 training scheme that matches BF16-level performance. The models are fully compatible with Hugging Face, NeMo, and Megatron-LM ecosystems. Experiments show that the 47B hybrid model achieves 3× faster inference than comparable Transformer models, while outperforming the 56B variant by 20% in speed with identical accuracy. On multiple benchmarks, it matches or exceeds Qwen-2.5 and Llama-3.1 in performance.

Technology Category

Application Category

📝 Abstract
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$ imes$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. All Nemotron-H models will be released, with support in Hugging Face, NeMo, and Megatron-LM.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference cost while maintaining model accuracy
Replacing self-attention layers with efficient Mamba layers
Enhancing inference speed and memory efficiency via compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer reduces inference cost
MiniPuzzle compression enhances speed and memory
FP8 training matches BF16 accuracy
🔎 Similar Papers
N
Nvidia Aaron Blakeman
NVIDIA
A
Aarti Basant
NVIDIA
Abhinav Khattar
Abhinav Khattar
NVIDIA
Machine LearningNatural Language ProcessingDeep Learning
A
Adi Renduchintala
NVIDIA
A
A. Bercovich
NVIDIA
A
Aleksander Ficek
NVIDIA
A
Alexis Bjorlin
NVIDIA
Ali Taghibakhshi
Ali Taghibakhshi
Deep Learning Algorithm Engineer, NVIDIA
Scientific ComputingMachine LearningGraph Neural NetworksReinforcement Learning
Amala Sanjay Deshmukh
Amala Sanjay Deshmukh
NVIDIA
A
Ameya Mahabaleshwarkar
NVIDIA
Andrew Tao
Andrew Tao
Nvidia
Computer VisionMachine Learning
A
Anna C. Shors
NVIDIA
A
Ashwath Aithal
NVIDIA
A
Ashwin Poojary
NVIDIA
A
Ayush Dattagupta
NVIDIA
B
B. Buddharaju
NVIDIA
B
Bobby Chen
NVIDIA
Boris Ginsburg
Boris Ginsburg
NVIDIA
Deep LearningSpeech RecognitionSpeech Synthesis
Boxin Wang
Boxin Wang
Senior Research Scientist at NVIDIA
Machine LearningNatural Language Processing
Brandon Norick
Brandon Norick
Microsoft
Large language models
B
Brian Butterfield
NVIDIA
Bryan Catanzaro
Bryan Catanzaro
NVIDIA
Parallel ComputingMachine Learning
C
Carlo del Mundo
NVIDIA
Chengyu Dong
Chengyu Dong
PhD student, UCSD
Machine LearningLanguage Modeling
C
Christine Harvey
NVIDIA
C
Christopher Parisien
NVIDIA
Dan Su
Dan Su
Tencent AI Lab
speech recognitionspeech synthesisspeaker recognition
Daniel Korzekwa
Daniel Korzekwa
Nvidia
PruningDistillationLLMVLMSpeech
D
Danny Yin
NVIDIA
D
Daria Gitman
NVIDIA
D
David Mosallanezhad
NVIDIA
Deepak Narayanan
Deepak Narayanan
NVIDIA
Computer SystemsSystems for Machine Learning
D
Denys Fridman
NVIDIA
D
Dima Rekesh
NVIDIA
Ding Ma
Ding Ma
Stanford University
Machine Learning & Deep LearningManagement ScienceOptimization
D
Dmytro Pykhtar
NVIDIA
D
Dong Ahn
NVIDIA
D
Duncan Riach
NVIDIA
D
Dusan Stosic
NVIDIA
E
Eileen Long
NVIDIA
Elad Segal
Elad Segal
NVIDIA
Natural Language UnderstandingMachine Learning
Ellie Evans
Ellie Evans
NVIDIA
Machine Learning
E
Eric Chung
NVIDIA
E
Erick Galinkin
NVIDIA
E
Evelina Bakhturina
NVIDIA
E
Ewa Dobrowolska
NVIDIA
Fei Jia
Fei Jia
NVIDIA
Fuxiao Liu
Fuxiao Liu
Research Scientist, NVIDIA
Multi-Modal LearningMLLMHallucination
G
Gargi Prasad
NVIDIA
G
Gerald Shen
NVIDIA
Guilin Liu
Guilin Liu
Research Scientist, NVIDIA
Computer VisionDeep LearningGenerative Models
G
Guo Chen
NVIDIA
Haifeng Qian
Haifeng Qian
Principal Applied Scientist, NVIDIA
Electrical EngineeringComputer ScienceMathematics
Helen Ngo
Helen Ngo
NVIDIA
H
Hongbin Liu
NVIDIA
H
Hui Li
NVIDIA
Igor Gitman
Igor Gitman
Applied Scientist, NVIDIA
Large Language ModelsMath ReasoningDeep Learning
I
I. Karmanov
NVIDIA
I
I. Moshkov
NVIDIA
I
Izik Golan
NVIDIA
Jan Kautz
Jan Kautz
Vice President of Research, NVIDIA Research
Computer VisionMachine LearningVisual Computing
J
J. Scowcroft
NVIDIA
Jared Casper
Jared Casper
Research Scientist, NVIDIA
Systems for Machine LearningComputer ArchitectureDigital Systems
J
Jarno Seppanen
NVIDIA
J
Jason Lu
NVIDIA
J
J. Sewall
NVIDIA
J
Jiaqi Zeng
NVIDIA
Jiaxuan You
Jiaxuan You
Assistant Professor, UIUC CS
Foundation ModelsGNNLarge Language Models
J
Jimmy Zhang
NVIDIA
J
Jing Zhang
NVIDIA
J
Jining Huang
NVIDIA
J
Jinze Xue
NVIDIA
J
Jocelyn Huang
NVIDIA
J
Joey Conway
NVIDIA
J
John Kamalu
NVIDIA
J
Jon Barker
NVIDIA
J
Jonathan Cohen
NVIDIA
J
Joseph Jennings
NVIDIA
J
Jupinder Parmar
NVIDIA
Karan Sapra
Karan Sapra
Clemson University, NVIDIA
Deep LearningHigh Performance ComputingImage ProcessingGenomicsCoexpression Networks
K
Kari Briski
NVIDIA
Kateryna Chumachenko
Kateryna Chumachenko
Research Scientist, NVIDIA
K
Katherine Luna
NVIDIA
K
Keshav Santhanam
NVIDIA
Kezhi Kong
Kezhi Kong
NVIDIA
Machine Learning
K
K. Sivamani
NVIDIA
K
Krzysztof Pawelec
NVIDIA
K
Kumar Anik
NVIDIA
K
Kunlun Li
NVIDIA
L
Lawrence C. McAfee
NVIDIA
Leon Derczynski
Leon Derczynski
ITU Copenhagen & NVIDIA
Natural Language ProcessingMachine LearningLLM SecurityOnline Harms
L
Lindsey Pavao
NVIDIA
Luis Vega
Luis Vega
NVIDIA
L
Lukas Voegtle
NVIDIA
M
Maciej Bala
NVIDIA
M
Maer Rodrigues de Melo
NVIDIA
Makesh Narsimhan Sreedhar
Makesh Narsimhan Sreedhar
NVIDIA
Language ModelsDialog AgentsMachine TranslationNatural Language Processing
Marcin Chochowski
Marcin Chochowski
NVIDIA, previously Samsung R&D Poland
NLPDeep learningbiometrics
Markus Kliegl
Markus Kliegl
NVIDIA
deep learningmachine learningartificial intelligencefluid mechanicsPDE
M
Marta M. Stepniewska-Dziubinska
NVIDIA
M
Matthieu Le
NVIDIA
Matvei Novikov
Matvei Novikov
NVIDIA
computer science
Mehrzad Samadi
Mehrzad Samadi
NVIDIA (Parabricks)
High Performance ComputingGenomicsGPUs
M
M. Andersch
NVIDIA
Michael Evans
Michael Evans
NVIDIA
Miguel Martinez
Miguel Martinez
Signal AI
Information RetrievalText AnalysisClassification
M
Mike Chrzanowski
NVIDIA
M
Michael Ranzinger
NVIDIA
M
Mikolaj Blaz
NVIDIA
M
Misha Smelyanskiy
NVIDIA
M
Mohamed Fawzy
NVIDIA
M
M. Shoeybi
NVIDIA
M
M. Patwary
NVIDIA
Nayeon Lee
Nayeon Lee
School of Computing, KAIST
AI EthicsCross-cultural NLPcomputational social scienceNLP
Nima Tajbakhsh
Nima Tajbakhsh
Nvidia Inc.
Computer vision and Artificial Intelligence
N
Ning Xu
NVIDIA
O
Oleg Rybakov
NVIDIA
Oleksii Kuchaiev
Oleksii Kuchaiev
NVIDIA
machine learningdeep learninggraph theorybioinformatics
Olivier Delalleau
Olivier Delalleau
NVIDIA
Artificial Intelligence
O
O. Nitski
NVIDIA
P
Parth Chadha
NVIDIA
P
Pasha Shamis
NVIDIA
P
P. Micikevicius
NVIDIA
Pavlo Molchanov
Pavlo Molchanov
NVIDIA Research
AIMachine LearningEfficient Deep LearningSemi-supervised learningnetwork inversion
P
Peter Dykas
NVIDIA
Philipp Fischer
Philipp Fischer
University of Freiburg
Computer Vision
P
P. Aquilanti
NVIDIA
P
Piotr Bialecki
NVIDIA
Prasoon Varshney
Prasoon Varshney
NVIDIA
AINLPMultimodal MLAI SafetyRobotics
Pritam Gundecha
Pritam Gundecha
NVIDIA
P
Przemek Tredak
NVIDIA
R
Rabeeh Karimi
NVIDIA
R
Rahul Kandu
NVIDIA
Ran El-Yaniv
Ran El-Yaniv
Professor of Computer Science, Technion - Israel Institute of Technology. Chief Scientist - Deci AI
Machine learningdeep learningfinancial modeling
Raviraj Joshi
Raviraj Joshi
Indian Institute of Technology Madras
computer sciencemachine learningnatural language processing
R
R. Waleffe
NVIDIA
R
Ruoxi Zhang
NVIDIA
S
Sabrina Kavanaugh
NVIDIA
S
Sahil Jain
NVIDIA
Samuel Kriman
Samuel Kriman
Unknown affiliation
Sangkug Lym
Sangkug Lym
Nvidia
S
S. Satheesh
NVIDIA
Saurav Muralidharan
Saurav Muralidharan
NVIDIA
Efficient Deep LearningLarge Language Models
S
Sean Narenthiran
NVIDIA
S
Selvaraj Anandaraj
NVIDIA
Seonmyeong Bak
Seonmyeong Bak
System Performance Engineer, NVIDIA
High Performance ComputingParallel ComputingRuntime SystemsDistributed Systems
S
S. Kashirsky
NVIDIA
S
Seungju Han
NVIDIA
S
Shantanu Acharya
NVIDIA
S
Shaona Ghosh
NVIDIA
S
Sharath Turuvekere Sreenivas
NVIDIA
Sharon Clay
Sharon Clay
NVIDIA
S
Shelby Thomas
NVIDIA
Shrimai Prabhumoye
Shrimai Prabhumoye
Senior Research Scientist @NVIDIA and Adjunct Assistant Professor @Boston University
Natural Language Processing
S
Shubham Pachori
NVIDIA
Shubham Toshniwal
Shubham Toshniwal
Senior Research Scientist, NVIDIA
ReasoningMemoryNLP
S
Shyamala Prayaga
NVIDIA
S
Siddhartha Jain
NVIDIA
S
Sirshak Das
NVIDIA
S
Slawomir Kierat
NVIDIA
Somshubra Majumdar
Somshubra Majumdar
NVIDIA
Machine LearningDeep LearningComputer VisionTime SeriesSpeech Recognition
S
Song Han
NVIDIA
Soumye Singhal
Soumye Singhal
NVIDIA
Deep LearningNLPArtificial Intelligence
S
Sriharsha Niverty
NVIDIA
S
Stefania Alborghetti
NVIDIA
S
Suseella Panguluri
NVIDIA
S
Swetha Bhendigeri
NVIDIA
Syeda Nahida Akter
Syeda Nahida Akter
Carnegie Mellon University
Multimodal Machine LearningLarge Language ModelVision Language ModelCommonsense Reasoning
S
Szymon Migacz
NVIDIA
T
Tal Shiri
NVIDIA
Terry Kong
Terry Kong
Unknown affiliation
T
Timo Roman
NVIDIA
T
Tomer Ronen
NVIDIA
T
Trisha Saar
NVIDIA
T
Tugrul Konuk
NVIDIA
T
Tuomas Rintamaki
NVIDIA
T
Tyler Poon
NVIDIA
U
Ushnish De
NVIDIA
V
V. Noroozi
NVIDIA
V
Varun Singh
NVIDIA
V
V. Korthikanti
NVIDIA
V
V. Kurin
NVIDIA
W
W. Ahmad
NVIDIA
W
Wei Du
NVIDIA
Wei Ping
Wei Ping
Distinguished Research Scientist, NVIDIA
machine learninglarge language modelsspeech synthesisreinforcement learning
Wenliang Dai
Wenliang Dai
Research Scientist, NVIDIA
Large Language ModelsPost-TrainingMultimodal
Wonmin Byeon
Wonmin Byeon
NVIDIA Research
Machine LearningComputer VisionArtificial Intelligence
Xiaowei Ren
Xiaowei Ren
Senior Deep Learning Architect, NVIDIA
Computer Architecture
Y
Yao Xu
NVIDIA
Yejin Choi
Yejin Choi
Stanford University / NVIDIA
Natural Language ProcessingDeep LearningArtificial IntelligenceCommonsense Reasoning
Yian Zhang
Yian Zhang
Unknown affiliation
Computer ScienceNatural Language ProcessingMachine LearningHuman Computer Interaction
Y
Ying Lin
NVIDIA
Yoshi Suhara
Yoshi Suhara
NVIDIA
Natural Language ProcessingMachine LearningComputational Social Science
Zhiding Yu
Zhiding Yu
Principal Research Scientist & Research Lead, NVIDIA Research
Computer VsionDeep Learning
Zhiqi Li
Zhiqi Li
PhD, Nanjing University
computer vision
Zhiyu Li
Zhiyu Li
Tianjin University
Robust controlattitude control
Z
Zhongbo Zhu
NVIDIA
Z
Zhuolin Yang
NVIDIA
Zijia Chen
Zijia Chen
Senior Deep Learning Scientist, NVIDIA Corporation
Natural Language ProcessingArtificial IntelligenceMultimodal Model