🤖 AI Summary
Large language models (LLMs) trained via next-token prediction—a self-supervised objective—lack explicit sequence-level semantic modeling, leading to training-inference misalignment and limiting instruction-following and text generation capabilities. To address this, we propose Contrastive Preference Optimization (CPO), the first method to integrate unsupervised contrastive learning into sequence-level preference modeling. CPO requires no human annotations, reward modeling, or reinforcement learning, and can be applied end-to-end at any training stage to inject sequence-level signals. By leveraging implicit reward estimation, CPO enables self-supervised sequence optimization. Empirical results across diverse instruction-following and text generation benchmarks demonstrate substantial improvements in win rates over standard next-token prediction baselines. CPO establishes an efficient, general-purpose paradigm for sequence-level alignment of LLMs, advancing both alignment fidelity and training scalability.
📝 Abstract
The next token prediction loss is the dominant self-supervised training objective for large language models and has achieved promising results in a variety of downstream tasks. However, upon closer investigation of this objective, we find that it lacks an understanding of sequence-level signals, leading to a mismatch between training and inference processes. To bridge this gap, we introduce a contrastive preference optimization (CPO) procedure that can inject sequence-level information into the language model at any training stage without expensive human labeled data. Our experiments show that the proposed objective surpasses the next token prediction in terms of win rate in the instruction-following and text generation tasks.