D3PO: Preference-Based Alignment of Discrete Diffusion Models

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of aligning discrete diffusion models with task-specific preferences in the absence of explicit reward functions. We propose the first Direct Preference Optimization (DPO) method tailored for continuous-time Markov chain (CTMC)-based discrete diffusion models. Our core contributions are threefold: (i) the first adaptation of DPO to the discrete diffusion framework; (ii) derivation of a novel preference loss that jointly optimizes preference alignment and fidelity to the reference distribution; and (iii) enabling efficient, controllable fine-tuning without reward modeling. Empirical evaluation on structured binary sequence generation demonstrates substantial improvements in preference alignment while strictly preserving structural validity of outputs. Compared to reinforcement learning baselines, our approach is more concise, stable, and practically deployable—offering a principled alternative for preference-driven generation in discrete latent spaces.

Technology Category

Application Category

📝 Abstract
Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D3PO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D3PO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D3PO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D3PO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.
Problem

Research questions and friction points this paper is trying to address.

Aligning discrete diffusion models with task-specific preferences
Fine-tuning generative processes using preference data
Controlled fine-tuning without explicit reward models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts DPO to discrete diffusion models
Uses preference data for fine-tuning
Maintains fidelity to reference distribution
🔎 Similar Papers
No similar papers found.
U
Umberto Borso
ETH Zurich, Centre for Artificial Intelligence, University College London
Davide Paglieri
Davide Paglieri
University College London
Artificial IntelligenceReinforcement LearningDeep LearningOpen-Endedness
J
Jude Wells
Centre for Artificial Intelligence, University College London
T
Tim Rocktaschel
Centre for Artificial Intelligence, University College London