Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

In practical deployment of small language models (SLMs) under stringent latency constraints, there exists a critical disconnect between parameter efficiency and real-device inference latency. Method: We propose a principled design framework centered on the depth-to-width ratio and operator selection, integrated with an evolutionary search-based hybrid architecture discovery pipeline. The framework incorporates efficient attention alternatives and weight normalization during training, jointly optimizing inference latency, energy efficiency, and accuracy on real hardware. Contribution/Results: Evaluated against Qwen3-1.7B and Qwen3-0.6B, our models achieve >5.5% average accuracy gain, 1.3–1.9× end-to-end latency reduction, and 18.7–45.6× throughput improvement. To the best of our knowledge, this work establishes the first generalizable, low-latency SLM design paradigm validated on physical devices.

Technology Category

Application Category

📝 Abstract

Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.

Problem

Research questions and friction points this paper is trying to address.

Optimizing small language models for real-device latency rather than parameter count

Identifying architectural factors like depth-width ratios and operator choices affecting latency

Developing hybrid SLMs that advance accuracy-latency trade-off frontiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes depth-width ratios for latency efficiency

Uses evolutionary search for hybrid operator combinations

Applies weight normalization to enhance training convergence

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

AI Model Optimization Architect

Qualcomm

$158,400.00 - $237,600.00

San Diego, California, United States of America

Authors to Follow