🤖 AI Summary
This work addresses the optimization instability in off-policy reinforcement learning caused by multi-step sampling in pretrained flow policies—particularly the critic error amplification that leads to model collapse—by reformulating the problem as a memoryless stochastic optimal control task and proposing the TRQAM algorithm. TRQAM leverages a trust-region mechanism that integrates projected dual descent with path-space KL divergence analysis, establishing for the first time within a stochastic optimal control framework a closed-form relationship between the KL divergence and the trust-region parameter λ. This enables precise constraint of policy deviation and stable fine-tuning. Evaluated on 50 OGBench tasks, the method achieves an overall success rate of 68%, substantially outperforming the strongest baseline at 46%.
📝 Abstract
Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $λ$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $λ$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.