Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the accuracy gap between low-rank pruning and semi-structured pruning for large language models (LLMs), this paper proposes Principal-Component-Driven Low-Rank Factorization (PIFA). PIFA unsupervisedly identifies principal rows in weight matrices and models their linear dependencies to construct a meta low-rank representation, enabling lossless compression. Further, we design an error-minimizing reconstruction strategy M—requiring no retraining—and integrate it with PIFA into an end-to-end framework, MPIFA. Evaluated at rank-to-dimension ratio r/d = 0.5, MPIFA reduces memory consumption by 24.2% and accelerates inference by 24.6% over baseline methods, while significantly lowering perplexity. Notably, it achieves, for the first time at comparable sparsity levels, accuracy on par with semi-structured pruning. Moreover, MPIFA natively supports GPU-accelerated low-rank tensor operations, ensuring high efficiency, hardware compatibility, and deployment friendliness.

Technology Category

Application Category

📝 Abstract

The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its tensor coherence and GPU compatibility across all densities. However, low-rank pruning has struggled to match the performance of semi-structured pruning, often doubling perplexity (PPL) at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving an additional 24.2% memory savings and 24.6% faster inference over low-rank layers at r/d = 0.5, thereby significantly enhancing performance at the same density. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free low-rank reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods and, for the first time, achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Model Compression

Resource Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

PIFA

Memory Efficiency

Model Pruning

🔎 Similar Papers

Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization