Soft $Q(λ)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing soft Q-learning lacks an efficient multi-step off-policy extension, making effective credit assignment under arbitrary behavior policies challenging. This work proposes Soft $Q(\lambda)$, which formalizes the n-step formulation of soft Q-learning for the first time and introduces the Soft Tree Backup operator to construct an eligibility-trace-based, online, fully off-policy framework for multi-step value estimation in entropy-regularized reinforcement learning. By unifying multi-step temporal difference learning, eligibility traces, and entropy regularization, the method yields a model-free algorithm that is theoretically sound, computationally efficient, and compatible with any behavior policy, thereby establishing a new foundation for off-policy evaluation and learning.

Technology Category

Application Category

📝 Abstract

Soft Q-learning has emerged as a versatile model-free method for entropy-regularised reinforcement learning, optimising for returns augmented with a penalty on the divergence from a reference policy. Despite its success, the multi-step extensions of soft Q-learning remain relatively unexplored and limited to on-policy action sampling under the Boltzmann policy. In this brief research note, we first present a formal $n$-step formulation for soft Q-learning and then extend this framework to the fully off-policy case by introducing a novel Soft Tree Backup operator. Finally, we unify these developments into Soft $Q(λ)$, an elegant online, off-policy, eligibility trace framework that allows for efficient credit assignment under arbitrary behaviour policies. Our derivations propose a model-free method for learning entropy-regularised value functions that can be utilised in future empirical experiments.

Problem

Research questions and friction points this paper is trying to address.

soft Q-learning

multi-step

off-policy

entropy regularisation

eligibility traces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Q-learning

off-policy

eligibility traces