π€ AI Summary
To address the significant degradation in safety capabilities observed during the specialization of large language models for domain experts, this paper proposes MergeAlignβa novel method that pioneers the application of model merging to jointly optimize safety and domain utility. MergeAlign integrates domain-specific vectors with alignment vectors via an interpretable vector interpolation mechanism, enabling fine-grained trade-offs between safety and performance. Evaluated on Llama3-based fine-tuned variants for medicine and finance, the approach leverages parameter-efficient alignment, model similarity analysis, and contribution decomposition assessment. Experiments demonstrate a 32% improvement in harmful response interception rate, with zero degradation in domain task performance (Β±0.2% fluctuation). The core contribution lies in introducing the model merging paradigm into safety alignment, thereby unifying knowledge fidelity and robustness within a single modeling framework.
π Abstract
There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called extsc{MergeAlign} that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply extsc{MergeAlign} on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.