Sequence-level Large Language Model Training with Contrastive Preference Optimization

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

136K/year

🤖 AI Summary

Large language models (LLMs) trained via next-token prediction—a self-supervised objective—lack explicit sequence-level semantic modeling, leading to training-inference misalignment and limiting instruction-following and text generation capabilities. To address this, we propose Contrastive Preference Optimization (CPO), the first method to integrate unsupervised contrastive learning into sequence-level preference modeling. CPO requires no human annotations, reward modeling, or reinforcement learning, and can be applied end-to-end at any training stage to inject sequence-level signals. By leveraging implicit reward estimation, CPO enables self-supervised sequence optimization. Empirical results across diverse instruction-following and text generation benchmarks demonstrate substantial improvements in win rates over standard next-token prediction baselines. CPO establishes an efficient, general-purpose paradigm for sequence-level alignment of LLMs, advancing both alignment fidelity and training scalability.

Technology Category

Application Category

📝 Abstract

The next token prediction loss is the dominant self-supervised training objective for large language models and has achieved promising results in a variety of downstream tasks. However, upon closer investigation of this objective, we find that it lacks an understanding of sequence-level signals, leading to a mismatch between training and inference processes. To bridge this gap, we introduce a contrastive preference optimization (CPO) procedure that can inject sequence-level information into the language model at any training stage without expensive human labeled data. Our experiments show that the proposed objective surpasses the next token prediction in terms of win rate in the instruction-following and text generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance sequence-level understanding

Bridge training-inference mismatch

Optimize without human labeled data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Preference Optimization technique

Sequence-level information integration

No human-labeled data required

🔎 Similar Papers

No similar papers found.