P$^2$ Law: Scaling Law for Post-Training After Model Pruning

📅 2024-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of theoretical guidance for selecting fine-tuning data volume after large language model (LLM) pruning. We propose and empirically validate the first quantitative scaling law for post-pruning fine-tuning—termed the P² Law—which jointly characterizes the impact of model size, pruning ratio, pre-pruning loss, and fine-tuning token count on post-fine-tuning loss. Leveraging extensive post-training experiments across Llama-3 and Qwen-2.5 families (7B–72B), both structured and unstructured pruning, and over 10 billion tokens, we derive the law via regression modeling and scaling analysis. The resulting framework generalizes across model scales, pruning ratios, and data volumes. The P² Law substantially reduces wasteful data investment, maintaining high prediction accuracy even under aggressive settings—e.g., >50% pruning, 72B models, and ≥10B fine-tuning tokens—thereby overcoming empirical hyperparameter tuning bottlenecks.

Technology Category

Application Category

📝 Abstract
Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs). To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation. While post-training benefits from larger datasets, once the dataset size is already substantial, increasing the training data provides only limited performance gains. To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.Through extensive experiments on the Llama-3 and Qwen-2.5 series models, pruned using various common pruning methods, we uncover the scaling extbf{Law} for extbf{P}ost-training after model extbf{P}runing, referred to as the P$^2$ Law.This law identifies four key factors for predicting the pruned model's post-training loss: the model size before pruning, the number of post-training tokens, the pruning rate, and the model's loss before pruning. Moreover, P$^2$ Law can generalize to larger dataset sizes, larger model sizes, and higher pruning rates, offering valuable insights for the post-training of pruned LLMs.
Problem

Research questions and friction points this paper is trying to address.

Determine optimal post-training data size for pruned LLMs
Identify key factors affecting pruned model's post-training loss
Generalize scaling law to larger datasets and higher pruning rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies P$^2$ Law for post-training pruned models
Optimizes post-training data for cost-performance balance
Generalizes to larger datasets, models, and pruning rates
🔎 Similar Papers
2024-05-06European Conference on Artificial IntelligenceCitations: 0
X
Xiaodong Chen
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
Y
Yuxuan Hu
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
X
Xiaokang Zhang
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
Yanling Wang
Yanling Wang
Zhipu AI
Data MiningNatural Language Processing
Cuiping Li
Cuiping Li
Renmin University of China
Databasebig data analysis and mining
H
Hong Chen
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
J
Jing Zhang
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China