Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This paper addresses key bottlenecks in joint structured pruning and quantization of neural networks—namely, engineering complexity, difficulty in black-box optimization, and poor architectural generalizability—by proposing an automated co-compression framework. Methodologically, it introduces: (1) a Quantization-Aware Dependency Graph (QADG) that explicitly models inter-layer coupling between quantization and pruning; (2) a biased projection stochastic gradient method with per-layer bit-width constraints, enabling efficient joint optimization under hard constraints; and (3) an interpretable co-learning strategy for pruning and quantization that unifies training objectives with structural regularization. The framework achieves state-of-the-art or competitive accuracy–compression trade-offs on both CNNs and Transformers, while significantly improving automation and cross-architecture generalizability.

Technology Category

Application Category

📝 Abstract

Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.

Problem

Research questions and friction points this paper is trying to address.

joint structured pruning quantization

efficient neural network compression

automatic co-optimization framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated joint pruning and quantization

Quantization-aware dependency graph

Stochastic gradient with bit constraints

🔎 Similar Papers

Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection