Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

📅 2024-11-11

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address the significant degradation in safety capabilities observed during the specialization of large language models for domain experts, this paper proposes MergeAlign—a novel method that pioneers the application of model merging to jointly optimize safety and domain utility. MergeAlign integrates domain-specific vectors with alignment vectors via an interpretable vector interpolation mechanism, enabling fine-grained trade-offs between safety and performance. Evaluated on Llama3-based fine-tuned variants for medicine and finance, the approach leverages parameter-efficient alignment, model similarity analysis, and contribution decomposition assessment. Experiments demonstrate a 32% improvement in harmful response interception rate, with zero degradation in domain task performance (±0.2% fluctuation). The core contribution lies in introducing the model merging paradigm into safety alignment, thereby unifying knowledge fidelity and robustness within a single modeling framework.

Technology Category

Application Category

📝 Abstract

There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called extsc{MergeAlign} that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply extsc{MergeAlign} on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.

Problem

Research questions and friction points this paper is trying to address.

Balancing domain expertise and safety in specialized LLMs

Preventing harmful content generation in domain-expert models

Merging domain and alignment vectors for safer LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

MergeAlign merges domain and alignment vectors

Improves safety without losing domain utility

Applied on Llama3 medicine and finance variants

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?