Embedding-perturbed Exploration Preference Optimization for Flow Models

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses a critical limitation in existing population-based preference optimization methods, where rapid decay of intra-group sample diversity often leads to vanishing learning signals, unstable training dynamics, and premature policy convergence or reward hacking. To mitigate this issue, the authors propose E²PO, a novel framework that introduces structured exploratory perturbations at the embedding layer—the first such approach to explicitly preserve intra-group variance and prevent diversity collapse. By integrating preference optimization with flow-matching training mechanisms, E²PO consistently generates discriminative learning signals that enable stable policy updates. Empirical results demonstrate that E²PO significantly outperforms state-of-the-art baselines, achieving superior fidelity and robustness in aligning with human preferences.

📝 Abstract

Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

Problem

Research questions and friction points this paper is trying to address.

intra-group variance decay

preference optimization

reinforcement learning

training instability

reward hacking

Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding-perturbed

preference optimization

intra-group variance