How can representation dimension dominate structurally pruned LLMs?

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing structured pruning methods for large language models (LLMs) rely on empirical evaluation, lacking a principled understanding of how representational dimensionality governs post-pruning performance. Method: We propose the first theoretical framework that explicitly models the analytical relationship between representation dimension and model performance (perplexity/accuracy), enabling accurate performance prediction without inference. Our approach integrates SliceGPT-based structured pruning, residual stream analysis in Transformers, and decomposability modeling of linear transformations to quantify dimensional sensitivity. Results: Experiments on Llama-3-8B-Instruct and Phi-3-mini-4k-Instruct demonstrate that our theoretical perplexity predictions deviate from empirical measurements by less than 1.2%. This significantly enhances interpretability and design efficiency in pruning, establishing a new paradigm for efficient, controllable LLM compression.

Technology Category

Application Category

📝 Abstract
Pruning assumes a subnetwork exists in the original deep neural network, which can achieve comparative model performance with less computation than the original. However, it is unclear how the model performance varies with the different subnetwork extractions. In this paper, we choose the representation dimension (or embedding dimension, model dimension, the dimension of the residual stream in the relevant literature) as the entry point to this issue. We investigate the linear transformations in the LLM transformer blocks and consider a specific structured pruning approach, SliceGPT, to extract the subnetworks of different representation dimensions. We mechanistically analyse the activation flow during the model forward passes, and find the representation dimension dominates the linear transformations, model predictions, and, finally, the model performance. Explicit analytical relations are given to calculate the pruned model performance (perplexity and accuracy) without actual evaluation, and are empirically validated with Llama-3-8B-Instruct and Phi-3-mini-4k-Instruct.
Problem

Research questions and friction points this paper is trying to address.

Explores how representation dimension affects pruned LLM performance.
Analyzes linear transformations and activation flow in pruned subnetworks.
Provides analytical relations to predict pruned model performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Focus on representation dimension in pruning
Use SliceGPT for structured pruning analysis
Analytical relations predict pruned model performance
🔎 Similar Papers
No similar papers found.