🤖 AI Summary
To address the longstanding trade-off between inference speed and accuracy in large language models (LLMs), this paper introduces Jet-Nemotron—a hybrid attention architecture LLM built upon post-neural architecture search (PostNAS). Our method freezes the MLP weights of a pretrained model and performs efficient, targeted search exclusively over the attention structure. Key innovations include full-attention layer optimization, selective integration of linear attention modules, a novel attention block design, and hardware-aware hyperparameter tuning. Under strict accuracy preservation, Jet-Nemotron-2B matches or surpasses the performance of larger models—including Qwen3, Llama3.2, and MoE variants—on comprehensive benchmarks such as MMLU. It achieves a 53.6× improvement in generation throughput and a 6.1× speedup in the prefill phase. To our knowledge, this is the first work to simultaneously achieve high accuracy and high efficiency in LLM inference, marking a significant step toward practical deployment of performant LLMs.
📝 Abstract
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.