🤖 AI Summary
To address the challenge of balancing inference efficiency and accuracy in large language models, this paper proposes a hybrid Mamba-Transformer architecture: Mamba layers—featuring O(N) computational complexity—replace most self-attention layers in Transformer blocks, yielding 8B and 47B/56B-scale models. We introduce MiniPuzzle, the first structured pruning–based compression method combined with knowledge distillation, to achieve effective model lightweighting. Additionally, we design a stable FP8 training scheme that matches BF16-level performance. The models are fully compatible with Hugging Face, NeMo, and Megatron-LM ecosystems. Experiments show that the 47B hybrid model achieves 3× faster inference than comparable Transformer models, while outperforming the 56B variant by 20% in speed with identical accuracy. On multiple benchmarks, it matches or exceeds Qwen-2.5 and Llama-3.1 in performance.
📝 Abstract
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$ imes$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. All Nemotron-H models will be released, with support in Hugging Face, NeMo, and Megatron-LM.