Entropy-Regularized Adjoint Matching for Offline RL

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

Offline reinforcement learning is hindered by popularity bias and support set constraints imposed by fixed behavioral policies, which impede exploration of low-density, high-reward regions. This work proposes the Maximum-Entropy Adversarial Matching (ME-AM) framework, which introduces entropy regularization and a mixture behavioral prior into adversarial matching for the first time. By integrating continuous flow matching generative models with mirror descent optimization, ME-AM effectively overcomes the limitations of the behavior distribution and expands the policy’s support. Experimental results demonstrate that the method achieves or surpasses state-of-the-art performance across multiple sparse-reward continuous control tasks, confirming its efficacy and robustness.

📝 Abstract

Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} that restricts off-manifold exploration. Existing workarounds, such as appending \textit{residual} Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textit{Maximum Entropy Adjoint Matching} (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two mechanisms: (1) a Mirror Descent entropy maximization objective that mitigates the popularity bias to facilitate the extraction of optimal policies from offline datasets, and (2) a \textit{Mixture Behavior Prior} that mathematically broadens the geometric support to encompass out-of-distribution high-reward regions. By exploring this extended geometry, ME-AM identifies robust actions while preserving the absolute continuity of the generative vector field. Empirically, ME-AM demonstrates competitive or superior performance compared to prior state-of-the-art (SOTA) methods across a diverse suite of sparse-reward continuous control environments.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

popularity bias

support binding

expressivity bottleneck

behavior distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximum Entropy Adjoint Matching

flow-matching models

offline reinforcement learning