Provable Partially Observable Reinforcement Learning with Privileged Information

📅 2024-12-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates partially observable reinforcement learning (PO-RL) with privileged information, aiming to improve decision quality and learning efficiency in both single-agent and multi-agent settings. We propose two practical paradigms: expert distillation and asymmetric actor-critic architectures. Under deterministic filtering assumptions, we construct a provably convergent belief-state learning oracle—establishing the first efficient convergence guarantee for expert distillation in PO-RL. Furthermore, we design the first multi-agent PO-RL algorithm that simultaneously satisfies theoretical sample-complexity guarantees and compatibility with centralized training with decentralized execution (CTDE). Theoretical analysis shows that our methods achieve polynomial sample complexity and quasi-polynomial computational complexity in both single-agent and multi-agent partially observable Markov decision processes (PO-MDPs), eliminating reliance on infeasible oracles and thereby significantly enhancing practicality and scalability.

Technology Category

Application Category

📝 Abstract
Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain emph{privileged information}, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of emph{expert distillation} (also known as emph{teacher-student} learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the emph{deterministic filter condition}, under which expert distillation achieves sample and computational complexities that are emph{both} polynomial. Furthermore, we investigate another useful empirical paradigm of emph{asymmetric actor-critic}, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve emph{filter stability} under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms featuring emph{centralized-training-with-decentralized-execution}, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Partial Observability
Privileged Information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deterministic Filter Conditioning
Asymmetric Actor-Critic Algorithm
Privileged Information Multi-Agent Reinforcement Learning
🔎 Similar Papers
No similar papers found.