Discrete Flow Matching Policy Optimization

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge of policy collapse in reinforcement learning fine-tuning of discrete flow matching (DFM) models, which arises from biased estimators and likelihood surrogates that compromise the balance between functionality and naturalness of generated sequences. The authors propose the first formulation of the DFM sampling process as a multi-step Markov decision process, enabling an unbiased policy gradient optimization framework. To preserve the pre-trained distribution’s characteristics and mitigate policy collapse, they introduce a total variation–based regularization term and provide a theoretical upper bound on the approximation error. Evaluated on DNA enhancer design, the method substantially outperforms existing reward-driven baselines, achieving higher predicted activity while maintaining both sequence naturalness and functional performance.

Technology Category

Application Category

📝 Abstract

We introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design. DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than the previous best reward-driven baselines. The regularization further improves alignment with the natural sequence distribution while preserving strong functional performance. These results establish DoMinO as an useful framework for controllable discrete sequence generation.

Problem

Research questions and friction points this paper is trying to address.

Discrete Flow Matching

Reinforcement Learning

Policy Optimization

Sequence Generation

Regularization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Flow Matching

Policy Optimization

Reinforcement Learning