🤖 AI Summary
Existing preference optimization methods predominantly rely on sequence-level modeling, which fails to capture the token-by-token decision dynamics during generation, thereby limiting alignment efficacy and training stability. This work proposes Token-level Bregman Preference Optimization (TBPO), which, using only standard sequence-level pairwise preferences, constructs a prefix-conditional token-level Bradley-Terry preference model. By leveraging Bregman divergence for density ratio matching, TBPO derives a concise and theoretically optimal objective function. It is the first method to enable explicit token-level preference optimization without additional annotations, introducing two instantiations—TBPO-Q and TBPO-A—that enhance alignment quality and output diversity through lightweight state baselines and advantage normalization, respectively. Experiments demonstrate that TBPO significantly outperforms current sequence-level and token-level baselines across instruction following, helpfulness/harmlessness, and summarization tasks.
📝 Abstract
Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.