🤖 AI Summary
Alignment of large language models (LLMs) remains a core challenge in NLP. This paper proposes Alignment-Aware Decoding (AAD), the first method to embed alignment optimization directly into the inference stage—without additional training or explicit reward modeling. AAD operates within the standard Direct Preference Optimization (DPO) framework, guiding decoding via implicit reward signals to dynamically adjust the output distribution and improve consistency with human preferences. Its key contributions are twofold: (1) shifting alignment from the training phase to the decoding phase, and (2) enabling self-generation of high-quality preference data from model outputs, thereby alleviating data scarcity in low-resource settings. Experiments demonstrate that AAD significantly outperforms strong baselines across multiple model scales and mainstream alignment benchmarks. Notably, it exhibits exceptional data augmentation and generalization capabilities under data-constrained conditions.
📝 Abstract
Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference. Theoretically, AAD can be interpreted as implicit reward optimization, yet it requires no specialized training beyond the standard DPO setup. Empirically, AAD consistently outperforms strong baselines across diverse alignment benchmarks and model scales. Moreover, in data-constrained settings, AAD can produce high-quality synthetic data to improve alignment under standard decoding, providing a practical solution when labeled data is limited.