Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the finite-time convergence of single-timescale actor-critic algorithms under dynamically evolving reward functions—such as reward shaping, entropy regularization, and curriculum learning—in Markov sampling settings. To address the bias in policy optimization and value estimation induced by per-step variations in reward parameters, we propose a novel non-asymptotic stochastic approximation framework. For the first time, we establish an $O(1/sqrt{T})$ convergence rate under the condition that reward parameters evolve slowly—matching the optimal rate for static rewards up to a $log^2 T$ factor. Key technical innovations include: (i) refined analysis of distribution mismatch in Markov settings; (ii) derivation of joint non-asymptotic error bounds for actor and critic updates; and (iii) the first rigorous theoretical foundation for evolutionary reward mechanisms in actor-critic learning.

Technology Category

Application Category

📝 Abstract
Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $log^2T$ in the static-reward case.
Problem

Research questions and friction points this paper is trying to address.

Analyzes actor-critic convergence with evolving reward functions
Establishes finite-time error bounds under Markovian sampling conditions
Demonstrates O(1/√T) rate preservation for slowly changing rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-timescale actor-critic with evolving rewards
Finite-time convergence analysis under Markovian sampling
Gradient-based reward updates matching actor-critic timescale
🔎 Similar Papers
No similar papers found.