Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Greedy heuristic-based pruning methods for large language models (LLMs) neglect inter-weight dependencies, leading to accumulated errors and suboptimal sparsity-accuracy trade-offs. Method: This work formulates structured pruning as a combinatorial optimization problem and introduces, for the first time, a convex relaxation to convert it into a differentiable optimization task. The relaxed problem is solved via the Frank–Wolfe algorithm, yielding theoretically guaranteed near-optimal pruning masks without full retraining—only a small calibration dataset is required for efficient, layer-wise collaborative pruning. Contribution/Results: Evaluated on GPT-series models, our method significantly mitigates accuracy degradation compared to state-of-the-art baselines, achieving an average +2.1% improvement in task accuracy under identical sparsity constraints. It simultaneously attains higher sparsity and improved inference efficiency, thereby overcoming the fundamental limitations of layer-wise greedy pruning.

Technology Category

Application Category

📝 Abstract

Pruning is a common technique to reduce the compute and storage requirements of Neural Networks. While conventional approaches typically retrain the model to recover pruning-induced performance degradation, state-of-the-art Large Language Model (LLM) pruning methods operate layer-wise, minimizing the per-layer pruning error on a small calibration dataset to avoid full retraining, which is considered computationally prohibitive for LLMs. However, finding the optimal pruning mask is a hard combinatorial problem and solving it to optimality is intractable. Existing methods hence rely on greedy heuristics that ignore the weight interactions in the pruning objective. In this work, we instead consider the convex relaxation of these combinatorial constraints and solve the resulting problem using the Frank-Wolfe (FW) algorithm. Our method drastically reduces the per-layer pruning error, outperforms strong baselines on state-of-the-art GPT architectures, and remains memory-efficient. We provide theoretical justification by showing that, combined with the convergence guarantees of the FW algorithm, we obtain an approximate solution to the original combinatorial problem upon rounding the relaxed solution to integrality.

Problem

Research questions and friction points this paper is trying to address.

Optimizing pruning masks for large language models

Reducing layer-wise pruning error without retraining

Solving combinatorial constraints via convex relaxation techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Frank-Wolfe algorithm for pruning optimization

Applies convex relaxation to combinatorial pruning constraints

Reduces layer-wise error via relaxed solution rounding

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models