Embedding-perturbed Exploration Preference Optimization for Flow Models

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses a critical limitation in existing population-based preference optimization methods, where rapid decay of intra-group sample diversity often leads to vanishing learning signals, unstable training dynamics, and premature policy convergence or reward hacking. To mitigate this issue, the authors propose E²PO, a novel framework that introduces structured exploratory perturbations at the embedding layer—the first such approach to explicitly preserve intra-group variance and prevent diversity collapse. By integrating preference optimization with flow-matching training mechanisms, E²PO consistently generates discriminative learning signals that enable stable policy updates. Empirical results demonstrate that E²PO significantly outperforms state-of-the-art baselines, achieving superior fidelity and robustness in aligning with human preferences.
📝 Abstract
Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.
Problem

Research questions and friction points this paper is trying to address.

intra-group variance decay
preference optimization
reinforcement learning
training instability
reward hacking
Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding-perturbed
preference optimization
intra-group variance
reinforcement learning
flow models