Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge of ensuring safety in constrained reinforcement learning when external adversarial perturbations—such as those from competitors or environmental disturbances—affect state transitions. It presents the first study of safe constrained reinforcement learning under explicitly adversarial dynamics, introducing a model-based Robust Hallucinated Constrained Upper Confidence Reinforcement Learning (RHC-UCRL) algorithm. RHC-UCRL models external perturbations as an adversarial policy and jointly optimizes the agent and adversary by distinguishing between epistemic and aleatoric uncertainty, enabling optimistic exploration for both parties during learning. Theoretical analysis establishes that RHC-UCRL simultaneously achieves sublinear upper bounds on both cumulative regret and constraint violation, providing provable safety guarantees for policy learning in adversarial environments.

Technology Category

Application Category

📝 Abstract

Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, $s_{h+1} = f(s_h, a_h, \bar{a}_h)+ω_h$ where $\bar{a}_h$ is the adversary/external action, $a_h$ is the agent's action, and $ω_h$ is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy} $\barπ$ that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.

Problem

Research questions and friction points this paper is trying to address.

adversarial dynamics

constrained reinforcement learning

safety guarantees

exogenous factors

strategic interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial dynamics

constrained reinforcement learning

optimism under uncertainty