Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenge in maximum entropy reinforcement learning where the optimal policy corresponds to an intractable energy-based distribution, and existing methods suffer from discretization bias when estimating log-likelihoods, hindering effective exploration-exploitation trade-offs. To overcome this, we propose FLAME, a novel framework that bypasses partition function estimation via a Q-reweighted flow matching objective, introduces an unbiased decoupled entropy estimator to correct estimation bias, and—building upon MeanFlow—integrates single-step flow matching into maximum entropy RL for the first time. Evaluated on MuJoCo benchmarks, FLAME matches the performance of multi-step diffusion policies, significantly outperforms Gaussian baselines, and substantially reduces inference overhead, achieving a favorable balance between representational capacity and computational efficiency.

Technology Category

Application Category

📝 Abstract

Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose \textbf{F}low-based \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL (\textbf{FLAME}), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting. Second, we design a decoupled entropy estimator that rigorously corrects bias, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Third, we integrate the MeanFlow formulation to achieve expressive and efficient one-step control. Empirical results on MuJoCo show that FLAME outperforms Gaussian baselines and matches multi-step diffusion policies with significantly lower inference cost. Code is available at https://github.com/lzqw/FLAME.

Problem

Research questions and friction points this paper is trying to address.

Maximum Entropy Reinforcement Learning

Flow Matching

Diffusion Policies

Log-likelihood Estimation

Exploration-Exploitation Trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow Matching

Maximum Entropy Reinforcement Learning

One-step Generation