Lillama: Large Language Models Compression via Low-Rank Feature Distillation

📅 2024-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of post-pruning fine-tuning for large language models (LLMs), this paper proposes a training-free low-rank feature distillation framework. Our method jointly models the low-rank structure of activations and low-rank weight initialization, integrating SVD-based weight initialization, teacher–student activation matching loss, and localized gradient updates—enabling rapid compression with only a small calibration dataset. The approach is architecture-agnostic, supporting both Transformer- and Mamba-based models. Experiments demonstrate strong efficacy: Mixtral-8x7B is compressed by 10B parameters (completed in minutes on a single A100 GPU) while retaining over 95% of original performance; Phi-2 (3B) achieves 40% parameter reduction and surpasses state-of-the-art models of comparable size; Mamba-3B sustains 99% of its original accuracy after 20% compression. This work establishes a scalable, efficient, and hardware-friendly paradigm for LLM compression without reliance on costly retraining.

Technology Category

Application Category

📝 Abstract
Current LLM structured pruning methods typically involve two steps: (1) compression with calibration data and (2) costly continued pretraining on billions of tokens to recover lost performance. This second step is necessary as the first significantly impacts model accuracy. Prior research suggests pretrained Transformer weights aren't inherently low-rank, unlike their activations, which may explain this drop. Based on this observation, we propose Lillama, a compression method that locally distills activations with low-rank weights. Using SVD for initialization and a joint loss combining teacher and student activations, we accelerate convergence and reduce memory use with local gradient updates. Lillama compresses Mixtral-8x7B within minutes on a single A100 GPU, removing 10 billion parameters while retaining over 95% of its original performance. Phi-2 3B can be compressed by 40% with just 13 million calibration tokens, resulting in a small model that competes with recent models of similar size. The method generalizes well to non-transformer architectures, compressing Mamba-3B by 20% while maintaining 99% performance.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model Pruning
Accuracy Degradation
Non-Simple Weight Handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

SVD Compression
Teacher-Student Activation Loss
Efficient Training
🔎 Similar Papers
No similar papers found.