SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

πŸ“… 2026-02-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the performance collapse commonly observed when offline reinforcement learning policies are fine-tuned online, which often stems from low-reward β€œvalleys” in the loss landscape. To mitigate this issue, the authors propose SMAC, a method that regularizes the Q-function during the offline phase to enforce a first-order derivative equivalence between the policy score and the action gradient of the Q-function. This constraint constructs a high-reward smooth path connecting the offline and online optima. SMAC is the first approach to leverage this first-order matching condition between policy gradients and Q-gradients to enable stable transfer. Empirical results on all six D4RL benchmark tasks demonstrate that SMAC achieves smooth transitions without performance degradation and reduces regret by 34%–58% compared to the best baseline in four environments.

Technology Category

Application Category

πŸ“ Abstract
Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.
Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning
online fine-tuning
performance drop
offline-to-online transfer
value-based RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

offline reinforcement learning
offline-to-online transfer
score matching
actor-critic
Q-function regularization
πŸ”Ž Similar Papers
No similar papers found.
N
Nathan S. de Lara
Department of Computer Science, University of Toronto, Toronto, Canada; Vector Institute
Florian Shkurti
Florian Shkurti
Assistant Professor, Computer Science, University of Toronto
RoboticsMachine LearningComputer VisionArtificial Intelligence