Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

📅 2024-11-28
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference cost and deployment challenges of large language models (LLMs), this paper proposes Puzzle, a hardware-aware neural architecture search (NAS) framework. Methodologically, Puzzle introduces two key innovations: (1) block-level local knowledge distillation (BLD), enabling efficient parallel architecture exploration; and (2) hybrid integer programming to jointly optimize model accuracy and hardware constraints—including GPU memory footprint and latency. Evaluated at the billion-parameter scale, Puzzle automatically discovers Nemotron-51B, a lightweight yet high-performance architecture. On a single H100 GPU, Nemotron-51B achieves 2.17× higher inference throughput than its teacher model while retaining 98.4% of its accuracy. Moreover, trained on only 45B tokens, it surpasses Llama-70B in single-GPU deployability—effectively bridging the gap between state-of-the-art performance and practical deployment feasibility.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. While increasing parameter counts improves accuracy, it also broadens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B), a publicly available model derived from Llama-3.1-70B-Instruct. Nemotron-51B achieves a 2.17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while retaining 98.4% of the original model's benchmark accuracies. Notably, it is the most accurate model supporting single H100 GPU inference with large batch sizes, despite training on only 45B tokens, far fewer than the 15T used to train Llama-70B. Lastly, we derive Llama-3.3-Nemotron-49B-Super-Base to demonstrate Puzzle can retain long-context and that lightweight alignment on these derived models allows them to surpass the parent model in specific capabilities. Our work establishes that powerful LLM models can be optimized for efficient deployment with only negligible loss in quality, underscoring that inference performance, not parameter count alone, should guide model selection.
Problem

Research questions and friction points this paper is trying to address.

Reduces high inference costs of large language models
Optimizes LLMs for efficient deployment on hardware
Maintains model accuracy while improving inference speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-aware framework accelerates LLM inference.
Blockwise local distillation for parallel architecture exploration.
Mixed-integer programming optimizes precise constraints.
🔎 Similar Papers
No similar papers found.