Pareto-Guided Optimal Transport for Multi-Reward Alignment

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Existing text-to-image generation methods for multi-reward alignment are vulnerable to reward hacking and struggle to balance conflicting objectives through simple weighted fusion. This work proposes a Pareto-front-guided optimal transport framework that constructs prompt-specific Pareto fronts and leverages distribution-aware optimal transport to map suboptimal samples onto these fronts, thereby achieving robust multi-reward alignment. The approach incorporates both online and offline optimization strategies and introduces two novel metrics—joint dominance rate and joint collapse rate—to evaluate the synergy and robustness of multi-reward alignment. Experimental results demonstrate that the proposed method improves the joint dominance rate by 11% and achieves a human preference win rate of nearly 80%, significantly outperforming strong baselines.
📝 Abstract
Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.
Problem

Research questions and friction points this paper is trying to address.

multi-reward alignment
reward hacking
Pareto optimization
text-to-image generation
preference optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pareto Frontier
Optimal Transport
Multi-Reward Alignment
Reward Hacking
Text-to-Image Generation