Boosting Parameter Efficiency in LLM-Based Recommendation through Sophisticated Pruning

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the high parameter count and deployment cost of large language model (LLM)-based recommender systems, this paper proposes a three-stage progressive structured pruning method that systematically identifies and exploits intra-module (within self-attention and MLP layers) and inter-layer redundancies for the first time. The approach integrates fine-grained intra-module pruning, width-to-depth staged inter-layer pruning, and knowledge distillation to mitigate performance degradation. Evaluated on three public benchmark datasets, it achieves over 95% compression of non-embedding parameters while retaining, on average, 88% of the original model’s recommendation accuracy—substantially improving parameter efficiency and deployment feasibility. Key contributions are: (1) uncovering and modeling fine-grained intra-component redundancy in LLM recommenders; and (2) establishing a progressive pruning paradigm that jointly ensures structural deployability and performance stability.

Technology Category

Application Category

📝 Abstract

LLM-based recommender systems have made significant progress; however, the deployment cost associated with the large parameter volume of LLMs still hinders their real-world applications. This work explores parameter pruning to improve parameter efficiency while maintaining recommendation quality, thereby enabling easier deployment. Unlike existing approaches that focus primarily on inter-layer redundancy, we uncover intra-layer redundancy within components such as self-attention and MLP modules. Building on this analysis, we propose a more fine-grained pruning approach that integrates both intra-layer and layer-wise pruning. Specifically, we introduce a three-stage pruning strategy that progressively prunes parameters at different levels and parts of the model, moving from intra-layer to layer-wise pruning, or from width to depth. Each stage also includes a performance restoration step using distillation techniques, helping to strike a balance between performance and parameter efficiency. Empirical results demonstrate the effectiveness of our approach: across three datasets, our models achieve an average of 88% of the original model's performance while pruning more than 95% of the non-embedding parameters. This underscores the potential of our method to significantly reduce resource requirements without greatly compromising recommendation quality. Our code will be available at: https://github.com/zheng-sl/PruneRec

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM deployment cost via parameter pruning

Addressing intra-layer redundancy in self-attention and MLP modules

Balancing performance and efficiency with multi-stage pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained pruning integrates intra-layer and layer-wise pruning

Three-stage pruning strategy with performance restoration

Achieves 88% performance with 95% parameter reduction

🔎 Similar Papers

SLMRec: Distilling Large Language Models into Small for Sequential Recommendation