A Systematic Study of Model Merging Techniques in Large Language Models

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The efficacy of existing model merging techniques—well-established for small-scale models or classifiers—remains unclear and largely unvalidated for large language models (LLMs). Method: We conduct a systematic empirical evaluation of six state-of-the-art merging methods—including Task Arithmetic, subspace fusion, and interference-aware fusion—across four open-source LLMs, multiple fine-tuned checkpoints, and standard benchmarks, assessing both performance and stability. Results: Only Task Arithmetic consistently improves performance across diverse tasks; all other methods induce significant degradation, revealing widespread failure of current merging techniques in the LLM regime. To our knowledge, this is the first study to empirically demonstrate that LLM merging requires purpose-built algorithms, and we advocate for “merging-aware” fine-tuning paradigms. Our findings provide critical empirical evidence and concrete direction for future research on LLM model merging.

Technology Category

Application Category

📝 Abstract
Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.
Problem

Research questions and friction points this paper is trying to address.

Evaluates model merging techniques for large language models
Tests if merging benefits transfer from small to large models
Identifies Task Arithmetic as the only reliable merging method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model merging combines fine-tuned checkpoints without training
Task Arithmetic reliably yields performance gains on LLMs
Current merging techniques do not transfer to modern LLMs
🔎 Similar Papers
No similar papers found.