Adaptive Pruning of Pretrained Transformer via Differential Inclusions

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Conventional pruning methods for large pre-trained Transformer models require retraining for each target compression ratio, hindering efficient deployment across diverse sparsity levels. Method: This paper proposes a differential-inclusion-based adaptive sparsification framework—the first to formulate structured mask parameter evolution as a continuous dynamical system—enabling single-shot optimization to generate a weight family spanning the full sparsity spectrum. It further introduces a joint pruning paradigm integrating modular pairing (coordinated decomposition of Q/K/V projections and linear layers) with low-rank approximation, preserving architectural integrity while compressing internal representations. Results: Evaluated on multiple mainstream Transformer backbones, the method achieves arbitrary compression ratios from a single pruning run, drastically reducing multi-ratio deployment overhead while maintaining controllable accuracy degradation.

Technology Category

Application Category

📝 Abstract

Large transformers have demonstrated remarkable success, making it necessary to compress these models to reduce inference costs while preserving their perfor-mance. Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio, which results in high computational costs. In contrast, we propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter. This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure. Therefore, the solution path identifies a Transformer weight family with various sparsity levels, offering greater flexibility and customization. In this paper, we introduce such an effective pruning method, termed SPP (Solution Path Pruning). To achieve effective pruning, we segment the transformers into paired modules, including query-key pairs, value-projection pairs, and sequential linear layers, and apply low-rank compression to these pairs, maintaining the output structure while enabling structural compression within the inner states. Extensive experiments conducted on various well-known transformer backbones have demonstrated the efficacy of SPP.

Problem

Research questions and friction points this paper is trying to address.

Model Compression

Pretrained Models

Resource Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

SPP method

model size adjustment

resource consumption reduction

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models