🤖 AI Summary
This work addresses online reinforcement learning in partially adversarial Markov decision processes (MDPs), where up to Λ steps per episode may be subject to either arbitrarily located or contiguous adversarial corruptions. To overcome the failure of conventional occupancy measures under adversarial conditions, the authors introduce a novel *conditional occupancy measure* and establish a unified optimization framework. Two algorithms are developed to handle arbitrary and contiguous adversarial steps without prior knowledge of their locations, achieving a regret bound of order $K^{2/3}$. The theoretical analysis provides the first systematic characterization of regret bounds in both partially and fully adversarial MDPs: for arbitrary corruptions, the regret is $\widetilde{O}(H S^\Lambda \sqrt{K S A^{\Lambda+1}})$, while for contiguous corruptions it improves to $\widetilde{O}(H \sqrt{K S^3 A^{\Lambda+1}})$. Nearly matching upper and lower bounds are established in the fully adversarial setting, and the impact of full-information versus bandit feedback on learning difficulty is elucidated.
📝 Abstract
We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of $\Lambda$ steps per episode. This model captures environments that are stable except at a few vulnerable points. We introduce \emph{conditioned occupancy measures}, which remain stable across episodes even with adversarial transitions, and use them to design two algorithms. The first handles arbitrary adversarial steps and achieves regret $\tilde{O}(H S^{\Lambda}\sqrt{K S A^{\Lambda+1}})$, where $K$ is the number of episodes, $S$ is the number of state, $A$ is the number of actions and $H$ is the episode's horizon. The second, assuming the adversarial steps are consecutive, improves the dependence on $S$ to $\tilde{O}(H\sqrt{K S^{3} A^{\Lambda+1}})$. We further give a $K^{2/3}$-regret reduction that removes the need to know which steps are the $\Lambda$ adversarial steps. We also characterize the regret of adversarial MDPs in the \emph{fully adversarial} setting ($\Lambda=H-1$) both for full-information and bandit feedback, and provide almost matching upper and lower bounds (slightly strengthen existing lower bounds, and clarify how different feedback structures affect the hardness of learning).