Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses a key limitation of existing KL divergence–based policy regularization methods in large language model alignment, which only match token probabilities at identical positions and ignore semantic similarity, thereby constraining alignment performance. To overcome this, the authors propose Wasserstein Policy Regularization (WPR), the first approach to integrate entropy-regularized Wasserstein distance into the RLHF framework. By leveraging the geometric structure of the token space, WPR enables semantic-aware policy optimization. Through its dual formulation, the proposed regularizer is efficiently transformed into a computable penalty term on the reward function. Experiments demonstrate that WPR significantly outperforms KL divergence and other f-divergence baselines across multiple alignment tasks, confirming the effectiveness of semantic-aware policy distance in enhancing model alignment.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are commonly aligned with human preferences using reinforcement learning from human feedback (RLHF). In this method, LLM policies are generally optimized through reward maximization with Kullback-Leibler (KL) divergence regularization of the reference policy. However, KL and its $f$-divergence variants only compare token probabilities at identical indices, failing to capture semantic similarity. We propose Wasserstein Policy Regularization (WPR), a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance, which incorporates the geometry of the token space. The dual formulation of the distance expresses the regularization as penalty terms applied to the reward via optimal dual variables, which yield a tractable objective compatible with standard RL algorithms. Empirically, our method outperforms KL- and $f$-divergence-based baselines, demonstrating the benefits of semantic-aware policy distances for alignment. Our code is available at https://github.com/aailab-kaist/WPR.

Problem

Research questions and friction points this paper is trying to address.

large language model alignment

reinforcement learning from human feedback

policy regularization

semantic similarity

KL divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wasserstein distance

policy regularization

semantic-aware alignment