🤖 AI Summary
Standard successor feature learning under nonlinear function approximation employs semi-gradient methods that lack convergence guarantees, leading to substantial estimation bias and limited transfer performance in multitask reinforcement learning. This work proposes Full-Gradient Successor Feature Representation Q-Learning (FG-SFRQL), which, for the first time, integrates the full-gradient temporal difference approach into successor feature learning. By minimizing the complete Bellman error and simultaneously taking gradients with respect to both online and target network parameters, FG-SFRQL achieves almost sure convergence. The method significantly enhances sample efficiency and cross-task transfer performance, consistently outperforming existing semi-gradient baselines in both discrete and continuous control tasks.
📝 Abstract
Successor Features (SF) combined with Generalized Policy Improvement (GPI) provide a robust framework for transfer learning in Reinforcement Learning (RL) by decoupling environment dynamics from reward functions. However, standard SF learning methods typically rely on semi-gradient Temporal Difference (TD) updates. When combined with non-linear function approximation, semi-gradient methods lack robust convergence guarantees and can lead to instability, particularly in the multi-task setting where accurate feature estimation is critical for effective GPI. Inspired by Full Gradient DQN, we propose Full-Gradient Successor Feature Representations Q-Learning (FG-SFRQL), an algorithm that optimizes the successor features by minimizing the full Mean Squared Bellman Error. Unlike standard approaches, our method computes gradients with respect to parameters in both the online and target networks. We provide a theoretical proof of almost-sure convergence for FG-SFRQL and demonstrate empirically that minimizing the full residual leads to superior sample efficiency and transfer performance compared to semi-gradient baselines in both discrete and continuous domains.