Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the longstanding trade-off between inference speed and accuracy in large language models (LLMs), this paper introduces Jet-Nemotron—a hybrid attention architecture LLM built upon post-neural architecture search (PostNAS). Our method freezes the MLP weights of a pretrained model and performs efficient, targeted search exclusively over the attention structure. Key innovations include full-attention layer optimization, selective integration of linear attention modules, a novel attention block design, and hardware-aware hyperparameter tuning. Under strict accuracy preservation, Jet-Nemotron-2B matches or surpasses the performance of larger models—including Qwen3, Llama3.2, and MoE variants—on comprehensive benchmarks such as MMLU. It achieves a 53.6× improvement in generation throughput and a 6.1× speedup in the prefill phase. To our knowledge, this is the first work to simultaneously achieve high accuracy and high efficiency in LLM inference, marking a significant step toward practical deployment of performant LLMs.

Technology Category

Application Category

📝 Abstract
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
Problem

Research questions and friction points this paper is trying to address.

Improving language model generation throughput while maintaining accuracy
Efficient neural architecture exploration for optimal attention block design
Achieving superior performance with smaller model size compared to competitors
Innovation

Methods, ideas, or system contributions that make the work stand out.

PostNAS pipeline for efficient model design
Freezes MLP weights to explore attention blocks
Hardware-aware hyperparameter search for optimization
🔎 Similar Papers
No similar papers found.