Convergence of regularized agent-state-based Q-learning in POMDPs

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the convergence of agent-state-based Q-learning—e.g., using RNN hidden states—in partially observable Markov decision processes (POMDPs). We propose RASQL, a recursive actor–critic algorithm that jointly performs recursive Q-table updates and policy regularization without explicit belief inference. RASQL integrates entropy-regularized policy optimization with recursive state modeling to asymptotically converge to the fixed point of the regularized MDP. Under mild assumptions—including ergodicity of the behavior policy’s stationary distribution—we establish global convergence; the result extends to periodic policies. Our analysis bridges Markov chain stability theory and recursive stochastic optimization. Numerical experiments confirm that the observed convergence rate aligns with theoretical predictions. To our knowledge, RASQL is the first entropy-regularized Q-learning framework for POMDPs with provable global convergence guarantees under implicit latent-state representations.

Technology Category

Application Category

📝 Abstract
In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i)~the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii)~policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.
Problem

Research questions and friction points this paper is trying to address.

Convergence of Q-learning with agent state in POMDPs
Analyzing regularized Q-learning without belief states
Establishing convergence to regularized MDP fixed point
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-state-based Q-learning with regularization
Convergence to regularized MDP fixed point
Periodic policy variant with same analysis
🔎 Similar Papers
No similar papers found.