Latent Adversarial Regularization for Offline Preference Optimization

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability in current language model preference optimization methods, which rely on token-level regularization and fail to capture semantic or behavioral similarity. To overcome this limitation, the authors propose a latent-space adversarial regularization approach that, for the first time, integrates adversarial training into offline preference optimization. By minimizing the distributional discrepancy between the policy and reference models in the latent space, the method constructs a regularizer that does not require explicit density estimation. This approach effectively mitigates the semantic shortcomings of token-level regularization, significantly enhancing robustness to distributional shift and noisy feedback. It consistently yields performance improvements across diverse model architectures and tasks while introducing only minimal computational overhead.

Technology Category

Application Category

📝 Abstract
Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
Problem

Research questions and friction points this paper is trying to address.

preference optimization
language models
latent space
offline learning
human feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent-space regularization
preference optimization
adversarial training
offline reinforcement learning
language models
E
Enyi Jiang
Department of Computer Science, Stanford University; Siebel School of Computing and Data Science, University of Illinois at Urbana-Champaign
Yibo Jacky Zhang
Yibo Jacky Zhang
Stanford University
machine learning
Yinglun Xu
Yinglun Xu
University of Illinois Urbana Champaign
Machine LearningReinforcement Learning
Andreas Haupt
Andreas Haupt
Stanford University
EconomicsArtificial IntelligencePersonalisationMarket Design
N
Nancy Amato
Siebel School of Computing and Data Science, University of Illinois at Urbana-Champaign
Sanmi Koyejo
Sanmi Koyejo
Assistant Professor, Stanford University
Machine LearningHealthcare AINeuroinformatics