DRAGON: Distributional Rewards Optimize Diffusion Generative Models

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Media generation models struggle to flexibly adapt to diverse human-perceived quality rewards without extensive human preference annotations. Method: This paper proposes DRAGON, a diffusion model training paradigm based on distributed reward optimization. DRAGON uniformly supports three reward types—instance-level, instance-to-distribution, and distribution-to-distribution—without requiring human preference labels. It introduces three key innovations: (1) cross-modal reference set construction, (2) online on-policy contrastive learning, and (3) multi-dimensional evaluation fusion leveraging CLAP, Vendi, and FAD metrics. Results: DRAGON is the first framework to empirically demonstrate that example-distribution rewards achieve performance comparable to explicitly modeled rewards, enabling high-fidelity music generation in fully unsupervised settings. Experiments across 20 reward functions yield an average win rate of 81.45%; using only example sets, it achieves a 60.95% human voting win rate—significantly outperforming RLHF and DPO baselines.

Technology Category

Application Category

📝 Abstract
We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at https://ml-dragon.github.io/web.
Problem

Research questions and friction points this paper is trying to address.

Optimizing media generation models with flexible reward functions
Enhancing human-perceived quality without human preference annotations
Cross-modality reward compatibility for diverse generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributional rewards optimize generative models flexibly
Cross-modality encoders enable diverse reference examples
Contrastive sets maximize reward for quality improvements
🔎 Similar Papers
No similar papers found.