OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction

πŸ“… 2025-12-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of balancing accuracy and efficiency in post-training pruning of large language models (LLMs), this paper proposes OPTIMAβ€”a highly efficient and scalable one-shot pruning method. Methodologically, OPTIMA formulates weight reconstruction as a set of parallel, layer-wise row-level quadratic programming (QP) problems, each constrained by a shared Hessian approximation, enabling globally optimal row-wise updates without fine-tuning. It employs a computationally efficient Hessian estimator and an accelerator-friendly batched QP solver, seamlessly integrating with existing mask selectors while ensuring both theoretical optimality and hardware efficiency. Empirically, OPTIMA achieves up to a 3.97% improvement in zero-shot accuracy across multiple LLM families. On a single NVIDIA H100 GPU, it completes end-to-end pruning of an 8B-parameter Transformer in under 40 hours, with peak memory consumption capped at 60 GB.

Technology Category

Application Category

πŸ“ Abstract
Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.
Problem

Research questions and friction points this paper is trying to address.

Balances accuracy and scalability in one-shot pruning
Solves layer-wise weight reconstruction via quadratic programming
Enables efficient pruning on single accelerator without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses quadratic programming for optimal weight reconstruction
Solves many small QPs in parallel on accelerators
Integrates with existing mask selectors without fine-tuning
πŸ”Ž Similar Papers
No similar papers found.