π€ AI Summary
Offline reinforcement learning is hindered by popularity bias and support set constraints imposed by fixed behavioral policies, which impede exploration of low-density, high-reward regions. This work proposes the Maximum-Entropy Adversarial Matching (ME-AM) framework, which introduces entropy regularization and a mixture behavioral prior into adversarial matching for the first time. By integrating continuous flow matching generative models with mirror descent optimization, ME-AM effectively overcomes the limitations of the behavior distribution and expands the policyβs support. Experimental results demonstrate that the method achieves or surpasses state-of-the-art performance across multiple sparse-reward continuous control tasks, confirming its efficacy and robustness.
π Abstract
Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} that restricts off-manifold exploration. Existing workarounds, such as appending \textit{residual} Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textit{Maximum Entropy Adjoint Matching} (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two mechanisms: (1) a Mirror Descent entropy maximization objective that mitigates the popularity bias to facilitate the extraction of optimal policies from offline datasets, and (2) a \textit{Mixture Behavior Prior} that mathematically broadens the geometric support to encompass out-of-distribution high-reward regions. By exploring this extended geometry, ME-AM identifies robust actions while preserving the absolute continuity of the generative vector field. Empirically, ME-AM demonstrates competitive or superior performance compared to prior state-of-the-art (SOTA) methods across a diverse suite of sparse-reward continuous control environments.