Discrete Flow Matching for Offline-to-Online Reinforcement Learning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of effectively transitioning from offline to online reinforcement learning in discrete action spaces, where existing diffusion- or flow-matching-based generative policies often struggle and suffer from catastrophic forgetting during online fine-tuning. To overcome this, the authors propose DRIFT, the first approach to integrate discrete flow matching into this setting. DRIFT employs an advantage-weighted continuous-time Markov chain (CTMC) policy for online adaptation and introduces a path-space regularizer to preserve knowledge of the pretrained trajectory distribution. For scalability to large action spaces, a candidate-set approximation mechanism is incorporated to enhance computational efficiency. Experimental results on benchmarks such as Jericho demonstrate that DRIFT significantly outperforms existing baselines—including methods leveraging pretrained language models—validating both the boundedness of the path-space penalty and the CTMC’s rapid adaptability to reward changes.

📝 Abstract

Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained language models. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.

Problem

Research questions and friction points this paper is trying to address.

offline-to-online reinforcement learning

discrete action spaces

generative policy

policy fine-tuning

knowledge preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Flow Matching

Offline-to-Online RL

Continuous-Time Markov Chain