Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

📅 2023-01-26

🏛️ International Conference on Machine Learning

📈 Citations: 3

✨ Influential: 0

career value

168K/year

🤖 AI Summary

In offline reinforcement learning, multi-step off-policy evaluation suffers from a fundamental bias-variance trade-off: conventional importance sampling truncation is irreversible, while trajectory-aware methods lack theoretical guarantees. This paper proposes a unified multi-step Bellman operator framework that jointly incorporates decision-level and trajectory-level credit assignment. We establish the first rigorous tabular convergence guarantees for a broad class of trajectory-aware estimators. Furthermore, we introduce Recency-Bounded Importance Sampling (RBIS), a novel weighting scheme enabling λ-robustness control—effectively balancing bias reduction and variance constraint. Empirical results demonstrate that our approach significantly improves stability and sample efficiency in off-policy control tasks; RBIS exhibits robust performance across diverse λ settings.

📝 Abstract

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $lambda$-values in an off-policy control task.

Problem

Research questions and friction points this paper is trying to address.

Addresses off-policy bias correction in reinforcement learning

Analyzes trajectory-aware methods for credit assignment theoretically

Introduces RBIS for robust off-policy control across λ-values

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multistep operator unifying per-decision and trajectory-aware methods

Convergence guarantees for trajectory-aware off-policy algorithms

Recency-Bounded Importance Sampling for robust off-policy control

🔎 Similar Papers

A Tractable Inference Perspective of Offline RL