AlignMerge - Alignment-Preserving Large Language Model Merging via Fisher-Guided Geometric Constraints

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Standard linear-weighting fusion methods for large language models (LLMs) often implicitly degrade safety alignment—even when loss remains unchanged—due to misalignment in parameter geometry. Method: We propose the first geometric fusion framework explicitly preserving safety alignment: modeling alignment as an invariant on the Fisher–Rao manifold; introducing a decoding-agnostic Alignment Quality Index (AQI) as a latent-space criterion; and jointly optimizing Fisher-information-guided local geometric modeling, alignment subspace projection (P_A), and AQI-driven regularization. A soft alignment budget constraint balances safety and capability. Results: Evaluated across five major LLM families, our method fuses safety anchors with task-specific experts, achieving significant gains in AQI, toxicity suppression rate, and LLM-as-a-judge alignment scores—while simultaneously improving instruction following, reasoning, and helpfulness, with no performance trade-offs.

Technology Category

Application Category

📝 Abstract

Merging large language models (LLMs) is a practical way to compose capabilities from multiple fine-tuned checkpoints without retraining. Yet standard schemes (linear weight soups, task vectors, and Fisher-weighted averaging) can preserve loss while quietly destroying alignment. We argue that merging is not a numerical trick but a geometry-constrained operation around an already-aligned anchor: fusion must be steered to respect safety geometry, not validated post hoc. We introduce AlignMerge, a geometry-aware merging framework that makes alignment an explicit invariant. In a local Fisher chart around an instruction-tuned base, we estimate an alignment subspace with projector P_A and optimize: L_AlignMerge = L_geo + lambda_align * L_align + lambda_bud * L_bud, where L_geo keeps the merge close to its experts in Fisher-Rao geometry, L_align penalizes motion along alignment-sensitive directions, and L_bud enforces a soft alignment budget. As the alignment functional we use the decoding-invariant Alignment Quality Index (AQI), a latent-space criterion that captures how cleanly aligned and misaligned behaviors separate in representation space. Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2), merging safety anchors with task experts, AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding the best expert on instruction-following, reasoning, and helpfulness. It also exhibits smaller alignment-subspace drift and fewer budget violations than Fisher soups, TIES, SafeMerge, and MergeAlign. These results make alignment-preserving merging a first-class design goal and suggest a path to geometry-aware composition of future foundation models.

Problem

Research questions and friction points this paper is trying to address.

Preserves alignment during large language model merging

Introduces geometry-aware framework to maintain safety invariants

Optimizes merging with Fisher-guided constraints to prevent misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fisher-guided geometric constraints preserve alignment during merging

Alignment subspace optimization with decoding-invariant quality index

Geometry-aware merging framework prevents alignment drift across models

🔎 Similar Papers

Checkpoint Merging via Bayesian Optimization in LLM Pretraining