Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the accuracy gap between low-rank pruning and semi-structured pruning for large language models (LLMs), this paper proposes Principal-Component-Driven Low-Rank Factorization (PIFA). PIFA unsupervisedly identifies principal rows in weight matrices and models their linear dependencies to construct a meta low-rank representation, enabling lossless compression. Further, we design an error-minimizing reconstruction strategy M—requiring no retraining—and integrate it with PIFA into an end-to-end framework, MPIFA. Evaluated at rank-to-dimension ratio r/d = 0.5, MPIFA reduces memory consumption by 24.2% and accelerates inference by 24.6% over baseline methods, while significantly lowering perplexity. Notably, it achieves, for the first time at comparable sparsity levels, accuracy on par with semi-structured pruning. Moreover, MPIFA natively supports GPU-accelerated low-rank tensor operations, ensuring high efficiency, hardware compatibility, and deployment friendliness.

Technology Category

Application Category

📝 Abstract
The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its tensor coherence and GPU compatibility across all densities. However, low-rank pruning has struggled to match the performance of semi-structured pruning, often doubling perplexity (PPL) at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving an additional 24.2% memory savings and 24.6% faster inference over low-rank layers at r/d = 0.5, thereby significantly enhancing performance at the same density. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free low-rank reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods and, for the first time, achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Model Compression
Resource Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

PIFA
Memory Efficiency
Model Pruning
🔎 Similar Papers
J
Jialin Zhao
Center for Complex Network Intelligence (CCNI), Tsinghua Laboratory of Brain and Intelligence (THBI), Department of Computer Science, Tsinghua University, Beijing, China
Yingtao Zhang
Yingtao Zhang
Professor of Computer Science,Harbin Institute of Technology
Pattern RecognitionMachine LearningComputer VisionImage Processing
C
C. Cannistraci
Center for Complex Network Intelligence (CCNI), Tsinghua Laboratory of Brain and Intelligence (THBI), Department of Computer Science, Department of Biomedical Engineering, Tsinghua University, Beijing, China