Provable Partially Observable Reinforcement Learning with Privileged Information

📅 2024-12-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This paper investigates partially observable reinforcement learning (PO-RL) with privileged information, aiming to improve decision quality and learning efficiency in both single-agent and multi-agent settings. We propose two practical paradigms: expert distillation and asymmetric actor-critic architectures. Under deterministic filtering assumptions, we construct a provably convergent belief-state learning oracle—establishing the first efficient convergence guarantee for expert distillation in PO-RL. Furthermore, we design the first multi-agent PO-RL algorithm that simultaneously satisfies theoretical sample-complexity guarantees and compatibility with centralized training with decentralized execution (CTDE). Theoretical analysis shows that our methods achieve polynomial sample complexity and quasi-polynomial computational complexity in both single-agent and multi-agent partially observable Markov decision processes (PO-MDPs), eliminating reliance on infeasible oracles and thereby significantly enhancing practicality and scalability.

Technology Category

Application Category

📝 Abstract

Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain emph{privileged information}, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of emph{expert distillation} (also known as emph{teacher-student} learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the emph{deterministic filter condition}, under which expert distillation achieves sample and computational complexities that are emph{both} polynomial. Furthermore, we investigate another useful empirical paradigm of emph{asymmetric actor-critic}, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve emph{filter stability} under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms featuring emph{centralized-training-with-decentralized-execution}, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Partial Observability

Privileged Information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deterministic Filter Conditioning

Asymmetric Actor-Critic Algorithm

Privileged Information Multi-Agent Reinforcement Learning

🔎 Similar Papers

Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding