Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the high training cost and rigid architecture of conventional large language models, which hinder dynamic adjustment of computational expenditure during inference. The authors propose Star Elastic, a method that embeds multi-scale submodels within a single parent model through one-time post-training, thereby enabling nested elastic architectures for the first time under a unified task. By integrating structural nesting across SSM, embedding channels, MoE, and FFN dimensions, an end-to-end trainable routing mechanism, curriculum knowledge distillation, and quantization-aware distillation, Star Elastic achieves substantial improvements on Nemotron Nano: it reduces training costs by 360× compared to full pretraining and by 7× relative to state-of-the-art compression methods, while simultaneously improving inference accuracy by 16% and reducing latency by 1.9×.
📝 Abstract
Training a family of large language models (LLMs), either from scratch or via iterative compression, is prohibitively expensive and inefficient, requiring separate training runs for each model in the family. In this paper, we introduce Star Elastic, a novel LLM post-training method that adds N nested submodels to a given parent reasoning model using the compute of one run (N-fold savings) via a single post-training job. Beyond reducing training costs, Star Elastic also addresses a fundamental limitation of efficient reasoning: the rigidity of static architectures, which forces the allocation of constant resources regardless of token difficulty. By unlocking elastic budget control, Star Elastic enables a novel inference scheme that uses different submodels for each reasoning phase (thinking and answering). Star Elastic supports (1) nesting along the SSM, embedding channel, MoE, and FFN axes, (2) learning nested submodels via an end-to-end trainable router, and (3) curriculum-based knowledge distillation. Building on the Nemotron Elastic framework, we apply Star Elastic to the NVIDIA Nemotron Nano models, with a particular focus on hybrid Mixture-of-Experts (MoE) architectures: from Nemotron Nano v3 (30B/3.6A), we generate 23B (2.8A) and 12B (2.0A) variants with 160B training tokens. All nested models match or outperform independently trained baselines of comparable size and achieve a 360x reduction versus pretraining from scratch and a 7x reduction over state-of-the-art compression. Crucially, elastic budget control advances the accuracy-latency Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency via dynamic per-phase model selection. We further extend Star Elastic to quantized regimes via Quantization-Aware Distillation (QAD), producing nested NVFP4 and FP8 elastic checkpoints that preserve zero-shot slicing while delivering smaller deployment footprints.
Problem

Research questions and friction points this paper is trying to address.

large language models
training efficiency
static architecture
elastic budget control
model compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

elastic budget control
nested submodels
post-training
Mixture-of-Experts
Quantization-Aware Distillation
Ali Taghibakhshi
Ali Taghibakhshi
Deep Learning Algorithm Engineer, NVIDIA
Scientific ComputingMachine LearningGraph Neural NetworksReinforcement Learning
Ruisi Cai
Ruisi Cai
University of Texas at Austin
computer visionmachine learningimage and video processing
Saurav Muralidharan
Saurav Muralidharan
NVIDIA
Efficient Deep LearningLarge Language Models
S
Sharath Turuvekere Sreenivas
Aditya Vavre
Aditya Vavre
University of Texas at Austin
Natural Language Processing
Ameya Sunil Mahabaleshwarkar
Ameya Sunil Mahabaleshwarkar
Deep Learning Scientist, NVIDIA
Deep LearningNatural Language ProcessingLarge Language ModelsSmall Language Models
Bilal Kartal
Bilal Kartal
NVIDIA
AIDeep LearningReinforcement LearningMulti-Agent Systems
S
Sheldon Liang
Marcin Chochowski
Marcin Chochowski
NVIDIA, previously Samsung R&D Poland
NLPDeep learningbiometrics
Zijia Chen
Zijia Chen
Senior Deep Learning Scientist, NVIDIA Corporation
Natural Language ProcessingArtificial IntelligenceMultimodal Model
Akhiad Bercovich
Akhiad Bercovich
PhD candidate, Weizmann Institute of Science
Single Cell GenomicsEpigenomicsMachine LearningDNA language/regulation modelsefficient LLMs
R
Ran Zilberstein
Ran El-Yaniv
Ran El-Yaniv
Professor of Computer Science, Technion - Israel Institute of Technology. Chief Scientist - Deci AI
Machine learningdeep learningfinancial modeling
Yonatan Geifman
Yonatan Geifman
NVIDIA
Machine LearningDeep Learning
Daniel Korzekwa
Daniel Korzekwa
Nvidia
PruningDistillationLLMVLMSpeech
Yoshi Suhara
Yoshi Suhara
NVIDIA
Natural Language ProcessingMachine LearningComputational Social Science
O
Oluwatobi Olabiyi
A
Ashwath Aithal
Nima Tajbakhsh
Nima Tajbakhsh
Nvidia Inc.
Computer vision and Artificial Intelligence
Pavlo Molchanov
Pavlo Molchanov
NVIDIA Research
AIMachine LearningEfficient Deep LearningSemi-supervised learningnetwork inversion