Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses sample-efficient self-play in offline robust two-player zero-sum Markov games (RTZMGs), tackling the sim-to-real gap arising from environmental uncertainty and distributional shift in historical data. We propose RTZ-VI-LCB—the first algorithm achieving optimal sample complexity in both state and action spaces. Built upon a model-based framework, it integrates optimistic robust value iteration with a data-driven Bernstein-type confidence penalty to enable robust value function estimation. Theoretically, we establish a near-optimal sample complexity bound and prove its tightness via an information-theoretic lower bound. Empirically, RTZ-VI-LCB significantly improves policy robustness and generalization performance over baselines, setting a new benchmark for offline robust game learning.

Technology Category

Application Category

📝 Abstract

Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm ( extit{RTZ-VI-LCB}) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.

Problem

Research questions and friction points this paper is trying to address.

Develops sample-efficient algorithm for offline robust two-player zero-sum Markov games

Addresses distribution shifts and environmental uncertainties in historical datasets

Establishes optimal sample complexity under partial coverage conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic robust value iteration with Bernstein penalty

Addresses distribution shifts under partial coverage

Achieves optimal sample complexity in offline settings

🔎 Similar Papers

A Tractable Inference Perspective of Offline RL