EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address performance degradation in compressed large language models (LLMs), this paper proposes a training-free, customizable residual low-rank compensation mechanism. Methodologically, it introduces the first error modeling framework operating directly in the intrinsic space of input activations; leveraging eigenvalue-guided low-rank projection (an enhanced SVD variant), it automatically prioritizes high-importance error components, enabling joint compensation across diverse compression schemes—including quantization and pruning—using only minimal calibration data and completing optimization within minutes. The core contribution lies in intrinsic-space-driven, error-directed reconstruction, which overcomes rigid format constraints and enables flexible adaptation to both task types and compression ratios. Evaluated on LLaMA3-8B under 4-bit quantization plus 2:4 sparsity, our method achieves absolute accuracy improvements of 31.31% on ARC-Easy, 12.88% on ARC-Challenge, and 9.69% on MathQA.

Technology Category

Application Category

📝 Abstract

In this work, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users (e.g., tasks, compression ratios), resulting in greater flexibility in adjusting overall capacity without being constrained by specific compression formats. However, naively applying SVD to derive residual paths causes suboptimal utilization of the low-rank representation capacity. Instead, we propose Training-free Eigenspace Low-Rank Approximation (EoRA), a method that directly minimizes compression-induced errors without requiring gradient-based training, achieving fast optimization in minutes using a small amount of calibration data. EoRA projects compression errors into the eigenspace of input activations, leveraging eigenvalues to effectively prioritize the reconstruction of high-importance error components. Moreover, EoRA can be seamlessly integrated with fine-tuning and quantization to further improve effectiveness and efficiency. EoRA consistently outperforms previous methods in compensating errors for compressed LLaMA2/3 models on various tasks, such as language generation, commonsense reasoning, and math reasoning tasks (e.g., 31.31%/12.88% and 9.69% improvements on ARC-Easy/ARC-Challenge and MathQA when compensating LLaMA3-8B that is quantized to 4-bit and pruned to 2:4 sparsity). EoRA offers a scalable, training-free solution to compensate for compression errors, making it a powerful tool to deploy LLMs in various capacity and efficiency requirements.

Problem

Research questions and friction points this paper is trying to address.

Compensate compression errors in LLMs

Minimize errors without gradient-based training

Integrate with fine-tuning and quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free Eigenspace Low-Rank Approximation

Minimizes compression-induced errors

Projects errors into eigenspace

🔎 Similar Papers

Data-freeWeight Compress and Denoise for Large Language Models