Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

📅 2024-02-18

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

152K/year

🤖 AI Summary

To address the substantial storage, training, and inference overheads induced by excessive layer depth in large language models (LLMs), this paper proposes a layer-pruning-based model slimming method to systematically investigate depth redundancy in LLMs. Unlike conventional compression paradigms, we empirically discover—contrary to the “deeper is better” assumption—that aggressively reducing the number of layers (retaining only 25%–50%) not only preserves performance but yields average accuracy gains of 1.2%–3.7% on prompt-tuned text classification tasks; remarkably, some single-layer variants even outperform their full-depth baselines. Our approach integrates layer importance estimation, prompt-based fine-tuning, and cross-layer performance attribution analysis. Evaluated across multiple benchmarks, it achieves 40%–65% reduction in inference GPU memory consumption, establishing a novel paradigm for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking. While traditional approaches such as model pruning or distillation offer ways for reducing model size, they often come at the expense of performance retention. In our investigation, we systematically explore the approach of reducing the number of layers in LLMs. Surprisingly, we observe that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks. Remarkably, in certain cases, models with a single layer outperform their fully layered counterparts. These findings offer valuable insights for future work aimed at mitigating the size constraints of LLMs while preserving their performance, thereby opening avenues for significantly more efficient use of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM size by cutting layers

Maintaining performance with fewer layers

Efficient LLM use for NLP tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reducing LLM layers for efficiency

Maintaining performance with fewer layers

Single-layer models outperform full layers

🔎 Similar Papers

Streamlining Redundant Layers to Compress Large Language Models