🤖 AI Summary
To address the accuracy gap between low-rank pruning and semi-structured pruning for large language models (LLMs), this paper proposes Principal-Component-Driven Low-Rank Factorization (PIFA). PIFA unsupervisedly identifies principal rows in weight matrices and models their linear dependencies to construct a meta low-rank representation, enabling lossless compression. Further, we design an error-minimizing reconstruction strategy M—requiring no retraining—and integrate it with PIFA into an end-to-end framework, MPIFA. Evaluated at rank-to-dimension ratio r/d = 0.5, MPIFA reduces memory consumption by 24.2% and accelerates inference by 24.6% over baseline methods, while significantly lowering perplexity. Notably, it achieves, for the first time at comparable sparsity levels, accuracy on par with semi-structured pruning. Moreover, MPIFA natively supports GPU-accelerated low-rank tensor operations, ensuring high efficiency, hardware compatibility, and deployment friendliness.
📝 Abstract
The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its tensor coherence and GPU compatibility across all densities. However, low-rank pruning has struggled to match the performance of semi-structured pruning, often doubling perplexity (PPL) at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving an additional 24.2% memory savings and 24.6% faster inference over low-rank layers at r/d = 0.5, thereby significantly enhancing performance at the same density. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free low-rank reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods and, for the first time, achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility.