Data-freeWeight Compress and Denoise for Large Language Models

📅 2024-02-26

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address memory and computational bottlenecks arising from the explosive growth of large language model (LLM) parameters, this work proposes a **data-agnostic joint rank-k approximation compression framework**, the first to simultaneously achieve weight compression, denoising, and orthogonality preservation—without any calibration data. The method integrates low-rank matrix decomposition, structured pruning, INT4/INT8 quantization, and orthogonality-constrained optimization, with compression strategies adaptively determined via spectral analysis of weight matrices. Experiments demonstrate that, without any task-specific data or fine-tuning, the framework achieves an 80% parameter pruning rate while retaining 93.43% of the original model’s performance. It significantly accelerates inference and reduces GPU memory footprint. Crucially, it eliminates the dependency on downstream data inherent in conventional compression methods, establishing a novel paradigm for efficient, deployable LLM lightweighting.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are reshaping the research landscape in artificial intelligence, particularly as model parameters scale up significantly, unlocking remarkable capabilities across various domains. Nevertheless, the scalability of model parameters faces constraints due to limitations in GPU memory and computational speed. To address these constraints, various weight compression methods have emerged, such as Pruning and Quantization. Given the low-rank nature of weight matrices in language models, the reduction of weights through matrix decomposition undoubtedly holds significant potential and promise. In this paper, drawing upon the intrinsic structure of LLMs, we propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. Significantly, our method is characterized by without necessitating additional involvement of any corpus, while simultaneously preserving orthogonality in conjunction with pruning and quantization methods. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data. Additionally, we explore the fundamental properties of the weight matrix of LLMs undergone Rank-k Approximation and conduct comprehensive experiments to elucidate our hypothesis.

Problem

Research questions and friction points this paper is trying to address.

Compress LLM weights without data

Reduce GPU memory and computational constraints

Maintain performance with matrix decomposition techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-free Joint Rank-k Approximation

Preserves orthogonality with pruning

Achieves 80% model pruning

🔎 Similar Papers

Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization