ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limitation of existing reinforcement learning post-training methods, which rely on fixed reward weightings and struggle to flexibly balance conflicting objectives during inference. The authors propose a preference-conditioned multi-objective reinforcement learning framework that treats continuously adjustable preference weights as conditional inputs to a diffusion model, enabling a single model to approximate the entire Pareto front. For the first time, this approach allows continuous adjustment of generation objectives at inference time without retraining or maintaining multiple checkpoints. Evaluated on three state-of-the-art diffusion backbones—SD3.5, FluxKontext, and LTX-2—the method matches or even surpasses baselines fine-tuned separately for fixed reward weightings, substantially overcoming the constraints of conventional scalarization-based approaches.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.

Problem

Research questions and friction points this paper is trying to address.

multi-objective reinforcement learning

Pareto front

diffusion models

reward control

preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

ParetoSlider

multi-objective reinforcement learning

diffusion models