Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing convergence analyses for robust temporal-difference (TD) learning are restricted to tabular MDPs or, under linear function approximation, rely on strong discounting assumptions. Method: We propose the first model-free robust TD algorithm with linear function approximation, defining uncertainty sets via total variation and Wasserstein-1 distances to enable worst-case policy evaluation under distributional ambiguity. Our approach integrates two-timescale stochastic approximation with outer-loop target network updates, eliminating the need for generative model access. Contribution/Results: We establish the first non-asymptotic, finite-time convergence guarantee for distributionally robust TD learning: the algorithm achieves ε-accurate value estimation with $ ilde{mathcal{O}}(1/varepsilon^2)$ sample complexity. This result bridges a fundamental theoretical gap in robust reinforcement learning under function approximation, providing the first rigorous convergence analysis for robust TD methods beyond the tabular setting.

Technology Category

Application Category

📝 Abstract
Distributionally robust reinforcement learning (DRRL) focuses on designing policies that achieve good performance under model uncertainties. In particular, we are interested in maximizing the worst-case long-term discounted reward, where the data for RL comes from a nominal model while the deployed environment can deviate from the nominal model within a prescribed uncertainty set. Existing convergence guarantees for robust temporal-difference (TD) learning for policy evaluation are limited to tabular MDPs or are dependent on restrictive discount-factor assumptions when function approximation is used. We present the first robust TD learning with linear function approximation, where robustness is measured with respect to the total-variation distance and Wasserstein-l distance uncertainty set. Additionally, our algorithm is both model-free and does not require generative access to the MDP. Our algorithm combines a two-time-scale stochastic-approximation update with an outer-loop target-network update. We establish an $ ilde{O}(1/ε^2)$ sample complexity to obtain an $ε$-accurate value estimate. Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts. The key ideas in the paper also extend in a relatively straightforward fashion to robust Q-learning with function approximation.
Problem

Research questions and friction points this paper is trying to address.

Develops robust TD learning with linear function approximation under model uncertainties
Establishes finite-time convergence guarantees without restrictive discount assumptions
Provides model-free algorithm with O(1/ε²) sample complexity for ε-accurate estimates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust TD learning with linear function approximation
Model-free algorithm without generative MDP access
Two-time-scale approximation with target-network updates
🔎 Similar Papers
No similar papers found.