ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations

πŸ“… 2025-05-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Addressing the challenge that large language models (LLMs) struggle to simultaneously achieve high compression ratios and strong performance under training-free settings, this paper proposes ReplaceMeβ€”a training-free, architecture-agnostic, parameter-free deep pruning method. ReplaceMe calibrates inter-layer response relationships using a small set of unlabeled data to estimate linear transformations that can replace redundant Transformer layers, then optimizes the weight merging process. It establishes, for the first time, an end-to-end training-free deep pruning paradigm. Applied to multiple mainstream LLMs, ReplaceMe achieves up to 25% layer removal while preserving approximately 90% of original task performance, with negligible computational overhead. Its performance significantly surpasses existing training-free pruning methods and matches state-of-the-art retraining-dependent approaches.

Technology Category

Application Category

πŸ“ Abstract
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation to approximate the pruned blocks. This estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at this repository.
Problem

Research questions and friction points this paper is trying to address.

Replacing transformer blocks with linear operations for network simplification
Achieving high performance without retraining or fine-tuning
Enabling up to 25% pruning with minimal computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free depth pruning via linear transformations
Calibration dataset estimates linear mapping
Seamlessly merges pruned blocks without parameters
D
Dmitriy Shopkhoev
MTS AI, ITMO University
A
Ammar Ali
MTS AI, ITMO University
M
Magauiya Zhussip
MTS AI
Valentin Malykh
Valentin Malykh
MTS AI / ITMO University
Artificial IntelligenceNatural Language UnderstandingNatural Language ProcessingDialog Systems
S
Stamatios Lefkimmiatis
MTS AI
N
N. Komodakis
University of Crete, IACM-Forth, Archimedes Athena RC
Sergey Zagoruyko
Sergey Zagoruyko
polynome.ai
Artificial IntelligenceComputer VisionMachine Learning