🤖 AI Summary
This work addresses the challenge in decentralized bilevel reinforcement learning where the leader can only observe but not directly influence the follower’s optimization process. To this end, we propose a hypergradient-based policy optimization method that explicitly models the follower’s optimal response to the leader’s decisions by deriving and estimating the hypergradient of the leader’s objective with respect to its own policy. Our approach relies solely on interaction samples for estimation and innovatively introduces the Boltzmann covariance trick to enable efficient hypergradient computation in decentralized bilevel Markov games—an advancement that scales effectively to high-dimensional leader policy spaces. Empirical evaluations on both discrete and continuous state tasks demonstrate that the proposed method significantly improves sample efficiency and successfully achieves end-to-end hypergradient updates in fully decentralized settings.
📝 Abstract
Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.