π€ AI Summary
This work addresses the limitation of existing reinforcement learning post-training methods, which rely on fixed reward weightings and struggle to flexibly balance conflicting objectives during inference. The authors propose a preference-conditioned multi-objective reinforcement learning framework that treats continuously adjustable preference weights as conditional inputs to a diffusion model, enabling a single model to approximate the entire Pareto front. For the first time, this approach allows continuous adjustment of generation objectives at inference time without retraining or maintaining multiple checkpoints. Evaluated on three state-of-the-art diffusion backbonesβSD3.5, FluxKontext, and LTX-2βthe method matches or even surpasses baselines fine-tuned separately for fixed reward weightings, substantially overcoming the constraints of conventional scalarization-based approaches.
π Abstract
Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.