Trust Region Q Adjoint Matching

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the optimization instability in off-policy reinforcement learning caused by multi-step sampling in pretrained flow policies—particularly the critic error amplification that leads to model collapse—by reformulating the problem as a memoryless stochastic optimal control task and proposing the TRQAM algorithm. TRQAM leverages a trust-region mechanism that integrates projected dual descent with path-space KL divergence analysis, establishing for the first time within a stochastic optimal control framework a closed-form relationship between the KL divergence and the trust-region parameter λ. This enables precise constraint of policy deviation and stable fine-tuning. Evaluated on 50 OGBench tasks, the method achieves an overall success rate of 68%, substantially outperforming the strongest baseline at 46%.
📝 Abstract
Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $λ$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $λ$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.
Problem

Research questions and friction points this paper is trying to address.

off-policy reinforcement learning
flow policies
optimization instability
critic error amplification
model collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust Region
Q-Adjoint Matching
Path-space KL
Stochastic Optimal Control
Off-policy Reinforcement Learning
🔎 Similar Papers
No similar papers found.