🤖 AI Summary
To address the high parameter count and deployment cost of large language model (LLM)-based recommender systems, this paper proposes a three-stage progressive structured pruning method that systematically identifies and exploits intra-module (within self-attention and MLP layers) and inter-layer redundancies for the first time. The approach integrates fine-grained intra-module pruning, width-to-depth staged inter-layer pruning, and knowledge distillation to mitigate performance degradation. Evaluated on three public benchmark datasets, it achieves over 95% compression of non-embedding parameters while retaining, on average, 88% of the original model’s recommendation accuracy—substantially improving parameter efficiency and deployment feasibility. Key contributions are: (1) uncovering and modeling fine-grained intra-component redundancy in LLM recommenders; and (2) establishing a progressive pruning paradigm that jointly ensures structural deployability and performance stability.
📝 Abstract
LLM-based recommender systems have made significant progress; however, the deployment cost associated with the large parameter volume of LLMs still hinders their real-world applications. This work explores parameter pruning to improve parameter efficiency while maintaining recommendation quality, thereby enabling easier deployment. Unlike existing approaches that focus primarily on inter-layer redundancy, we uncover intra-layer redundancy within components such as self-attention and MLP modules. Building on this analysis, we propose a more fine-grained pruning approach that integrates both intra-layer and layer-wise pruning. Specifically, we introduce a three-stage pruning strategy that progressively prunes parameters at different levels and parts of the model, moving from intra-layer to layer-wise pruning, or from width to depth. Each stage also includes a performance restoration step using distillation techniques, helping to strike a balance between performance and parameter efficiency. Empirical results demonstrate the effectiveness of our approach: across three datasets, our models achieve an average of 88% of the original model's performance while pruning more than 95% of the non-embedding parameters. This underscores the potential of our method to significantly reduce resource requirements without greatly compromising recommendation quality. Our code will be available at: https://github.com/zheng-sl/PruneRec