Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This paper identifies an “alignment tax” in post-training alignment of large language models (LLMs): alignment methods (e.g., RLHF, DPO) not only degrade task accuracy but also severely impair output calibration—inducing overconfidence—and reduce response diversity. To resolve the inherent trade-off between accuracy and calibration, we propose a weight interpolation-based model fusion approach that constructs a Pareto-optimal frontier between the supervised fine-tuned (SFT) model and its aligned counterpart (RLHF/DPO). Our method incurs no additional training or inference overhead and achieves simultaneous improvements in both accuracy and calibration—significantly reducing Expected Calibration Error (ECE)—while restoring output diversity. Extensive experiments across multiple benchmarks demonstrate that the interpolated models consistently outperform both the original aligned models and the base SFT models, achieving, for the first time, synergistic gains in capability and reliability.

Technology Category

Application Category

📝 Abstract

The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.

Problem

Research questions and friction points this paper is trying to address.

Addresses the alignment-calibration trade-off in post-training models

Mitigates overconfidence and reliability loss through model merging

Improves both accuracy and calibration via Pareto-optimal interpolations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model merging via weight interpolation technique

Achieves Pareto-optimal balance in alignment trade-off

Improves both model accuracy and calibration simultaneously

🔎 Similar Papers

Pareto Merging: Multi-Objective Optimization for Preference-Aware Model Merging