MARBLE: Multi-Aspect Reward Balance for Diffusion RL

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge of optimizing multidimensional rewards in reinforcement learning fine-tuning of diffusion models, where sample-level misalignment hinders effective coordination and traditional weighted summation often dilutes supervisory signals. To overcome this, the authors propose MARBLE, a novel framework that introduces a gradient-coordination-based mechanism for balancing multiple rewards. MARBLE maintains independent advantage estimators for each reward dimension and aligns objectives through quadratic programming to fuse policy gradients, eliminating the need for manual weight tuning. Combined with amortized gradient computation and exponential moving average smoothing, the method achieves near-single-reward training efficiency—reaching 0.97× the speed of baseline approaches—while simultaneously improving alignment across five distinct reward dimensions on SD3.5 Medium, shifting the cosine similarity of the worst-aligned reward’s gradient from negative to consistently positive in 80% of training batches.

📝 Abstract

Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.

Problem

Research questions and friction points this paper is trying to address.

diffusion models

reinforcement learning

multi-aspect rewards

reward balancing

human preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-aspect reward balancing

Gradient-space optimization

Quadratic programming