Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of reasoning-intensive large language models, which stems from generating lengthy reasoning traces, by improving deployment efficiency without compromising accuracy. We extend the Puzzle framework to Mixture-of-Experts (MoE) models for the first time and introduce a request-level efficiency–centric evaluation methodology for balancing accuracy and speed. By integrating heterogeneous expert pruning, windowed attention substitution, FP8 quantization of KV caches, and post-training reinforcement learning fine-tuning, our approach achieves a 2.82× throughput speedup on a single H100 GPU—1.63× and 1.22× for long and short contexts, respectively—and up to a 1.29× improvement in request-level efficiency. Crucially, the optimized model maintains or slightly exceeds the original model’s accuracy across multiple benchmarks, achieving 100.8%–108.2% relative performance.

Technology Category

Application Category

📝 Abstract
Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy--speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
inference optimization
reasoning efficiency
large language models
serving cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts pruning
post-training neural architecture search
FP8 KV-cache quantization
window attention
request-level efficiency
🔎 Similar Papers
No similar papers found.
Akhiad Bercovich
Akhiad Bercovich
PhD candidate, Weizmann Institute of Science
Single Cell GenomicsEpigenomicsMachine LearningDNA language/regulation modelsefficient LLMs
Nir Ailon
Nir Ailon
Technion Israel Institute of Technology
AlgorithmsML
V
Vladimir Anisimov
T
Tomer Asida
N
Nave Assaf
Mohammad Dabbah
Mohammad Dabbah
Group CTO, Como 1907 | SENT Entertainment
Machine LearningArtificial IntelligencePattern RecognitionComputer VisionSignal Processing
Ido Galil
Ido Galil
Technion - Israel Institute of Technology
Machine learningDeep learning
Amnon Geifman
Amnon Geifman
PhD Student Weizmann Institute of Science
Computer VisionStructure from MotionDeep LearningMachine Learning
Yonatan Geifman
Yonatan Geifman
NVIDIA
Machine LearningDeep Learning
Izhak Golan
Izhak Golan
NVIDIA -Deep Learning Researcher
Deep LearningAnomaly DetectionGenerative ModelsLarge Language Models
R
Roi Koren
Itay Levy
Itay Levy
Researcher, NVIDIA
Natural Language Processing
Zach Moshe
Zach Moshe
Unknown affiliation
Pavlo Molchanov
Pavlo Molchanov
NVIDIA Research
AIMachine LearningEfficient Deep LearningSemi-supervised learningnetwork inversion
Najeeb Nabwani
Najeeb Nabwani
Unknown affiliation
Deep Learning
M
Mostofa Patwari
Omri Puny
Omri Puny
Ph.D. student, Weizmann Institute of Science
Graph Neural NetworksGeometric Deep LearningDeep Learning
T
Tomer Ronen
I
Itamar Schen
Elad Segal
Elad Segal
NVIDIA
Natural Language UnderstandingMachine Learning
I
Ido Shahaf
O
Oren Tropp
R
Ran Zilberstein
Ran El-Yaniv
Ran El-Yaniv
Professor of Computer Science, Technion - Israel Institute of Technology. Chief Scientist - Deci AI
Machine learningdeep learningfinancial modeling