Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the high computational cost of reasoning-intensive large language models, which stems from generating lengthy reasoning traces, by improving deployment efficiency without compromising accuracy. We extend the Puzzle framework to Mixture-of-Experts (MoE) models for the first time and introduce a request-level efficiency–centric evaluation methodology for balancing accuracy and speed. By integrating heterogeneous expert pruning, windowed attention substitution, FP8 quantization of KV caches, and post-training reinforcement learning fine-tuning, our approach achieves a 2.82× throughput speedup on a single H100 GPU—1.63× and 1.22× for long and short contexts, respectively—and up to a 1.29× improvement in request-level efficiency. Crucially, the optimized model maintains or slightly exceeds the original model’s accuracy across multiple benchmarks, achieving 100.8%–108.2% relative performance.

Technology Category

Application Category

📝 Abstract

Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy--speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

inference optimization

reasoning efficiency

large language models

serving cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts pruning

post-training neural architecture search

FP8 KV-cache quantization