Discrete Tilt Matching

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the challenge that masked diffusion-based large language models lack a tractable sequence-level likelihood objective during reinforcement learning (RL) fine-tuning, hindering direct optimization. To overcome this limitation, the authors propose a likelihood-free approach that establishes, for the first time, an RL fine-tuning framework grounded in state-level posterior matching. Specifically, they introduce reward-skewed local de-masking posterior matching, combined with a weighted cross-entropy objective, an annealing schedule, and control variate techniques to explicitly minimize divergence and enhance training stability. Empirical results demonstrate that the method effectively mitigates mode collapse in synthetic maze tasks and significantly improves performance on Sudoku and Countdown when applied to LLaDA-8B-Instruct, while maintaining competitive results on MATH500 and GSM8K benchmarks.

Technology Category

Application Category

📝 Abstract
Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM's annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.
Problem

Research questions and friction points this paper is trying to address.

discrete diffusion
masked diffusion models
reinforcement learning
likelihood-free
fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Tilt Matching
masked diffusion LLMs
likelihood-free optimization
reward tilting
control variates
🔎 Similar Papers
No similar papers found.