Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning

πŸ“… 2025-11-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses sample-efficient self-play in offline robust two-player zero-sum Markov games (RTZMGs), tackling the sim-to-real gap arising from environmental uncertainty and distributional shift in historical data. We propose RTZ-VI-LCBβ€”the first algorithm achieving optimal sample complexity in both state and action spaces. Built upon a model-based framework, it integrates optimistic robust value iteration with a data-driven Bernstein-type confidence penalty to enable robust value function estimation. Theoretically, we establish a near-optimal sample complexity bound and prove its tightness via an information-theoretic lower bound. Empirically, RTZ-VI-LCB significantly improves policy robustness and generalization performance over baselines, setting a new benchmark for offline robust game learning.

Technology Category

Application Category

πŸ“ Abstract
Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm ( extit{RTZ-VI-LCB}) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.
Problem

Research questions and friction points this paper is trying to address.

Develops sample-efficient algorithm for offline robust two-player zero-sum Markov games
Addresses distribution shifts and environmental uncertainties in historical datasets
Establishes optimal sample complexity under partial coverage conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic robust value iteration with Bernstein penalty
Addresses distribution shifts under partial coverage
Achieves optimal sample complexity in offline settings
πŸ”Ž Similar Papers
No similar papers found.
N
Na Li
Zhejiang University
Z
Zewu Zheng
The Chinese University of Hong Kong
Wei Ni
Wei Ni
FIEEE, AAIA Fellow, Senior Principal Scientist & Conjoint Professor, CSIRO/UNSW
6G security and privacyconnected and trusted intelligenceapplied AI/ML
Hangguan Shan
Hangguan Shan
Zhejiang University
Wireless communications and wireless networking
W
Wenjie Zhang
University of New South Wales
X
Xinyu Li
Huazhong University of Science and Technology