🤖 AI Summary
Addressing the challenge of jointly aligning large language models (LLMs) with the tripartite objectives of helpfulness, honesty, and harmlessness (3H), this work introduces the first benchmark dedicated to 3H-aware model merging, uncovering latent cooperation and conflict mechanisms across these dimensions. We propose R-TSVM—a training-free model merging framework integrating outlier-aware parameter weighting and sparse adaptive rank selection—to mitigate interference from redundant parameters and outliers. Unlike conventional data-mixing paradigms, R-TSVM leverages singular vector decomposition, parameter-level conflict resolution, and heavy-tailed distribution modeling to achieve superior multi-objective trade-offs. Extensive experiments demonstrate that R-TSVM consistently outperforms 12 model merging and 3 data mixing baselines across 10 datasets and 5 annotation dimensions, significantly improving holistic 3H alignment. All models are publicly released on Hugging Face.
📝 Abstract
Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI, with existing methods like data mixture strategies facing limitations including reliance on expert knowledge and conflicting optimization signals. While model merging offers a promising alternative by integrating specialized models, its potential for 3H optimization remains underexplored. This paper establishes the first comprehensive benchmark for model merging in 3H-aligned LLMs, systematically evaluating 15 methods (12 training-free merging and 3 data mixture techniques) across 10 datasets associated with 5 annotation dimensions, 2 LLM families, and 2 training paradigms. Our analysis reveals three pivotal insights: (i) previously overlooked collaborative/conflicting relationships among 3H dimensions, (ii) the consistent superiority of model merging over data mixture approaches in balancing alignment trade-offs, and (iii) the critical role of parameter-level conflict resolution through redundant component pruning and outlier mitigation. Building on these findings, we propose R-TSVM, a Reweighting-enhanced Task Singular Vector Merging method that incorporates outlier-aware parameter weighting and sparsity-adaptive rank selection strategies adapted to the heavy-tailed parameter distribution and sparsity for LLMs, further improving LLM alignment across multiple evaluations. Our models will be available at https://huggingface.co/Jinluan.