Locality-Aware Redundancy Pruning for LLM Depth Compression

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the issue of representational redundancy along the depth dimension of large language models, which hinders inference efficiency. Existing one-shot pruning methods often rely on local layer importance metrics or fixed assumptions about redundancy, limiting their adaptability across diverse architectures. To overcome this, the paper proposes LoRP, a training-free, one-shot depth pruning framework that introduces the novel concept of representational locality. LoRP identifies redundant layers by clustering hidden states across layers based on their similarity and formulates a Representational Locality Score (RLS) to dynamically characterize the distribution of redundancy. Guided by RLS, it allocates layer-specific pruning strategies using only a small calibration set. Experiments show that LoRP consistently outperforms existing methods across multiple large language models, achieving substantial compression while preserving or even improving perplexity and downstream task accuracy.

📝 Abstract

Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy.

Problem

Research questions and friction points this paper is trying to address.

depth pruning

representational redundancy

large language models

layer similarity

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

depth pruning

representation locality

redundancy pruning