🤖 AI Summary
This work addresses the high training cost and rigid architecture of conventional large language models, which hinder dynamic adjustment of computational expenditure during inference. The authors propose Star Elastic, a method that embeds multi-scale submodels within a single parent model through one-time post-training, thereby enabling nested elastic architectures for the first time under a unified task. By integrating structural nesting across SSM, embedding channels, MoE, and FFN dimensions, an end-to-end trainable routing mechanism, curriculum knowledge distillation, and quantization-aware distillation, Star Elastic achieves substantial improvements on Nemotron Nano: it reduces training costs by 360× compared to full pretraining and by 7× relative to state-of-the-art compression methods, while simultaneously improving inference accuracy by 16% and reducing latency by 1.9×.
📝 Abstract
Training a family of large language models (LLMs), either from scratch or via iterative compression, is prohibitively expensive and inefficient, requiring separate training runs for each model in the family. In this paper, we introduce Star Elastic, a novel LLM post-training method that adds N nested submodels to a given parent reasoning model using the compute of one run (N-fold savings) via a single post-training job. Beyond reducing training costs, Star Elastic also addresses a fundamental limitation of efficient reasoning: the rigidity of static architectures, which forces the allocation of constant resources regardless of token difficulty. By unlocking elastic budget control, Star Elastic enables a novel inference scheme that uses different submodels for each reasoning phase (thinking and answering). Star Elastic supports (1) nesting along the SSM, embedding channel, MoE, and FFN axes, (2) learning nested submodels via an end-to-end trainable router, and (3) curriculum-based knowledge distillation. Building on the Nemotron Elastic framework, we apply Star Elastic to the NVIDIA Nemotron Nano models, with a particular focus on hybrid Mixture-of-Experts (MoE) architectures: from Nemotron Nano v3 (30B/3.6A), we generate 23B (2.8A) and 12B (2.0A) variants with 160B training tokens. All nested models match or outperform independently trained baselines of comparable size and achieve a 360x reduction versus pretraining from scratch and a 7x reduction over state-of-the-art compression. Crucially, elastic budget control advances the accuracy-latency Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency via dynamic per-phase model selection. We further extend Star Elastic to quantized regimes via Quantization-Aware Distillation (QAD), producing nested NVFP4 and FP8 elastic checkpoints that preserve zero-shot slicing while delivering smaller deployment footprints.