LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning methods struggle to directly optimize diffusion-based large language models due to their reliance on intractable exact likelihood computations, which yield high-variance gradient estimates. This work proposes a likelihood-free policy optimization framework that introduces vector field flow matching into discrete token spaces, enabling direct optimization of denoising logits via contrastive learning while incorporating intermediate-state consistency constraints to enhance generation quality. To circumvent approximation errors in likelihood estimation, the method further designs a geometric velocity correction mechanism that enables accurate gradient estimation. Additionally, it leverages probability flow straightening to substantially reduce the number of inference steps. Experiments demonstrate that the proposed approach outperforms state-of-the-art methods on code and reasoning benchmarks while achieving approximately 20% faster inference.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
Problem

Research questions and friction points this paper is trying to address.

Likelihood-Free
Policy Optimization
Diffusion Models
Reinforcement Learning
dLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Likelihood-Free Policy Optimization
Diffusion Large Language Models
Flow Matching
Contrastive Updates
Probability Flow Straightening
🔎 Similar Papers
No similar papers found.
Chenxing Wei
Chenxing Wei
Shenzhen University
nlp
J
Jiazhen Kang
Southeast University
H
Hong Wang
University of Science and Technology of China
Jianqing Zhang
Jianqing Zhang
Shanghai Jiao Tong University
federated learningsynthetic data generationdomain adaptation
H
Hao Jiang
University of Science and Technology of China
Xiaolong Xu
Xiaolong Xu
2019~2025 Ant Group/2025~Now ByteDance
Graph Neural NetworksKnowledge GraphFederated Learning
N
Ningyuan Sun
Bytedance
Y
Ying He
Shenzhen University
F. Richard Yu
F. Richard Yu
Carleton University, FRSC, FCAE, MAE, FIEEE, FEIC
Intell.&Auto. Sys.ML&Embodied AIIoTBlockchain
Y
Yao Shu
Hong Kong University of Science and Technology (Guangzhou)
Bo Jiang
Bo Jiang
Coding AI Engineer, Bytedance