Discrete Flow Matching Policy Optimization

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of policy collapse in reinforcement learning fine-tuning of discrete flow matching (DFM) models, which arises from biased estimators and likelihood surrogates that compromise the balance between functionality and naturalness of generated sequences. The authors propose the first formulation of the DFM sampling process as a multi-step Markov decision process, enabling an unbiased policy gradient optimization framework. To preserve the pre-trained distribution’s characteristics and mitigate policy collapse, they introduce a total variation–based regularization term and provide a theoretical upper bound on the approximation error. Evaluated on DNA enhancer design, the method substantially outperforms existing reward-driven baselines, achieving higher predicted activity while maintaining both sequence naturalness and functional performance.
📝 Abstract
We introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design. DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than the previous best reward-driven baselines. The regularization further improves alignment with the natural sequence distribution while preserving strong functional performance. These results establish DoMinO as an useful framework for controllable discrete sequence generation.
Problem

Research questions and friction points this paper is trying to address.

Discrete Flow Matching
Reinforcement Learning
Policy Optimization
Sequence Generation
Regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Flow Matching
Policy Optimization
Reinforcement Learning
Total Variation Regularization
Sequence Generation
🔎 Similar Papers
No similar papers found.
M
Maojiang Su
Center for Foundation Models and Generative AI, Northwestern University, Evanston, IL 60208, USA; Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
P
Po-Chung Hsieh
Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan
Weimin Wu
Weimin Wu
Ph.D. Candidate in Computer Science, Northwestern University
AI for BiologyML Theory
M
Mingcheng Lu
J
Jiunhau Chen
Department of Physics, National Taiwan University, Taipei 10617, Taiwan
Jerry Yao-Chieh Hu
Jerry Yao-Chieh Hu
Northwestern University
Machine Learning(* denotes equal contribution)
Han Liu
Han Liu
Orrington Lunt Professor of Computer Science, Statistics and Data Science, Northwestern University
Machine LearningLarge Foundation Models for AIAI for Science and Finance