MergeBench: A Benchmark for Merging Domain-Specialized LLMs

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Domain-specific large language model (LLM) fusion lacks systematic evaluation and scalable methodologies. Method: We introduce the first standardized fusion benchmark for 2B–9B Llama/Gemma models, covering instruction following, mathematics, multilingual understanding, code generation, and safety. We systematically evaluate eight fusion methods across multi-task performance, knowledge retention (i.e., forgetting), and inference efficiency. Contribution/Results: We propose parameter-arithmetic-based fusion techniques, a unified fine-tuning and evaluation protocol, and a multidimensional analysis framework incorporating accuracy, forgetting rate, latency, and GPU memory usage. Our experiments reveal that base model strength, coefficient tuning, and sparsification critically govern knowledge preservation; stronger base models consistently yield superior fusion outcomes—enabling practical model selection guidelines. All code and evaluation tooling are open-sourced to facilitate integration of fusion into mainstream LLM training pipelines.

Technology Category

Application Category

📝 Abstract
Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. We open source our code at href{https://github.com/uiuctml/MergeBench}{https://github.com/uiuctml/MergeBench}.
Problem

Research questions and friction points this paper is trying to address.

Evaluating model merging for large domain-specialized LLMs
Assessing merging methods across diverse tasks and scales
Addressing computational cost and performance gaps in merging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for merging domain-specialized LLMs
Standardized finetuning and evaluation protocols
Assesses eight merging methods comprehensively
🔎 Similar Papers
No similar papers found.
Y
Yifei He
University of Illinois Urbana-Champaign
Siqi Zeng
Siqi Zeng
PhD, University of Illinois at Urbana-Champaign
machine learning
Y
Yuzheng Hu
University of Illinois Urbana-Champaign
R
Rui Yang
University of Illinois Urbana-Champaign
T
Tong Zhang
University of Illinois Urbana-Champaign
H
Han Zhao
University of Illinois Urbana-Champaign