🤖 AI Summary
This paper identifies an “alignment tax” in post-training alignment of large language models (LLMs): alignment methods (e.g., RLHF, DPO) not only degrade task accuracy but also severely impair output calibration—inducing overconfidence—and reduce response diversity. To resolve the inherent trade-off between accuracy and calibration, we propose a weight interpolation-based model fusion approach that constructs a Pareto-optimal frontier between the supervised fine-tuned (SFT) model and its aligned counterpart (RLHF/DPO). Our method incurs no additional training or inference overhead and achieves simultaneous improvements in both accuracy and calibration—significantly reducing Expected Calibration Error (ECE)—while restoring output diversity. Extensive experiments across multiple benchmarks demonstrate that the interpolated models consistently outperform both the original aligned models and the base SFT models, achieving, for the first time, synergistic gains in capability and reliability.
📝 Abstract
The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.