Llama-Nemotron: Efficient Reasoning Models

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address industry demand for efficient inference and memory optimization, this work introduces an open-source heterogeneous inference model series (8B/49B/253B) featuring a novel architecture that enables dynamic switching among inference modes. Methodologically, we integrate neural architecture search (built upon Llama 3), knowledge distillation, continual pretraining, supervised fine-tuning, and large-scale reinforcement learning in a joint optimization framework. Our key contributions are: (1) the first commercially licensed, full-stack open-source inference models—accompanied by complete post-training datasets and training code; (2) competitive reasoning capability relative to DeepSeek-R1, while achieving substantial throughput gains and reduced memory footprint; and (3) unified strong reasoning performance and inference efficiency via on-demand, runtime mode switching. All models, datasets, and training frameworks are publicly released under permissive open-source licenses.

Technology Category

Application Category

📝 Abstract

We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.

Problem

Research questions and friction points this paper is trying to address.

Develop efficient open-source reasoning models for enterprise use

Enhance inference throughput and memory efficiency in reasoning tasks

Introduce dynamic reasoning toggle for flexible chat and reasoning modes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous models with dynamic reasoning toggle

Neural architecture search for accelerated inference

Open-source under NVIDIA Open Model License

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting