Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the low efficiency of merging multiple LoRA checkpoints in parameter-efficient fine-tuning (PEFT), this paper proposes Metric-Weighted Averaging (MWA): the first method to explicitly model training loss or step count as dynamic weighting criteria, incorporating an adjustable penalty factor to optimize weight distribution—requiring only a single hyperparameter and no additional training. Evaluated on mathematical reasoning, preference alignment, and instruction-following tasks, MWA consistently outperforms the best individual checkpoint, with loss-weighted fusion achieving up to a 5% absolute accuracy gain. Computational overhead is negligible. The core innovation lies in directly leveraging performance metrics to drive weight design, breaking from conventional uniform or empirically tuned weighting schemes. Within the PEFT paradigm, MWA enables efficient, lightweight, and plug-and-play model ensembling.

Technology Category

Application Category

📝 Abstract

Checkpoint merging is a technique for combining multiple model snapshots into a single superior model, potentially reducing training time for large language models. This paper explores checkpoint merging in the context of parameter-efficient fine-tuning (PEFT), where only small adapter modules (e.g. LoRA) are trained. We propose Metrics-Weighted Averaging (MWA), a simple yet effective method to merge model checkpoints by weighting their parameters according to performance metrics. In particular, we investigate weighting by training loss and by training steps, under the intuition that lower-loss or later-step checkpoints are more valuable. We introduce a formula with a penalty factor to adjust weight distribution, requiring only one hyperparameter regardless of the number of checkpoints. Experiments on three fine-tuning tasks (mathematical reasoning, preference alignment, and general instruction tuning) show that MWA consistently produces merged models that outperform the naive uniform average of checkpoints. Notably, loss-weighted merging often yields the best results, delivering up to 5% higher task accuracy than the baseline uniform merge and even surpassing the final individual checkpoint's performance. These findings validate checkpoint merging for PEFT and demonstrate that a metric-driven weighting heuristic can efficiently boost model performance with minimal computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Efficiently merging PEFT checkpoints via weighted averaging

Improving model performance using metric-driven merging

Reducing training overhead while enhancing task accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Metrics-Weighted Averaging for checkpoint merging

Weighting by training loss and steps

Single hyperparameter with penalty factor

🔎 Similar Papers

Checkpoint Merging via Bayesian Optimization in LLM Pretraining