🤖 AI Summary
Standard linear-weighting fusion methods for large language models (LLMs) often implicitly degrade safety alignment—even when loss remains unchanged—due to misalignment in parameter geometry. Method: We propose the first geometric fusion framework explicitly preserving safety alignment: modeling alignment as an invariant on the Fisher–Rao manifold; introducing a decoding-agnostic Alignment Quality Index (AQI) as a latent-space criterion; and jointly optimizing Fisher-information-guided local geometric modeling, alignment subspace projection (P_A), and AQI-driven regularization. A soft alignment budget constraint balances safety and capability. Results: Evaluated across five major LLM families, our method fuses safety anchors with task-specific experts, achieving significant gains in AQI, toxicity suppression rate, and LLM-as-a-judge alignment scores—while simultaneously improving instruction following, reasoning, and helpfulness, with no performance trade-offs.
📝 Abstract
Merging large language models (LLMs) is a practical way to compose capabilities from multiple fine-tuned checkpoints without retraining. Yet standard schemes (linear weight soups, task vectors, and Fisher-weighted averaging) can preserve loss while quietly destroying alignment. We argue that merging is not a numerical trick but a geometry-constrained operation around an already-aligned anchor: fusion must be steered to respect safety geometry, not validated post hoc.
We introduce AlignMerge, a geometry-aware merging framework that makes alignment an explicit invariant. In a local Fisher chart around an instruction-tuned base, we estimate an alignment subspace with projector P_A and optimize:
L_AlignMerge = L_geo + lambda_align * L_align + lambda_bud * L_bud,
where L_geo keeps the merge close to its experts in Fisher-Rao geometry, L_align penalizes motion along alignment-sensitive directions, and L_bud enforces a soft alignment budget. As the alignment functional we use the decoding-invariant Alignment Quality Index (AQI), a latent-space criterion that captures how cleanly aligned and misaligned behaviors separate in representation space.
Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2), merging safety anchors with task experts, AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding the best expert on instruction-following, reasoning, and helpfulness. It also exhibits smaller alignment-subspace drift and fewer budget violations than Fisher soups, TIES, SafeMerge, and MergeAlign. These results make alignment-preserving merging a first-class design goal and suggest a path to geometry-aware composition of future foundation models.