🤖 AI Summary
This work models the Actor-Critic (AC) framework as a bilevel optimization (BLO) problem, revealing its intrinsic Stackelberg game structure: the Critic performs nested updates to learn the optimal response to the Actor’s policy, while the Actor updates along a hypergradient that accounts for the Critic’s dynamic evolution. A key challenge lies in hypergradient computation, which requires inverse-Hessian-vector products and suffers from numerical instability. To address this, we propose the first application of the Nyström method to hypergradient estimation in AC, significantly improving numerical stability; concurrently, we linearize the Critic’s objective to reduce computational overhead. We theoretically establish that the algorithm converges polynomially with high probability to a local strong Stackelberg equilibrium. Empirical evaluation across diverse discrete- and continuous-control benchmarks demonstrates performance on par with or superior to PPO, validating both effectiveness and robustness.
📝 Abstract
The dependency of the actor on the critic in actor-critic (AC) reinforcement learning means that AC can be characterized as a bilevel optimization (BLO) problem, also called a Stackelberg game. This characterization motivates two modifications to vanilla AC algorithms. First, the critic's update should be nested to learn a best response to the actor's policy. Second, the actor should update according to a hypergradient that takes changes in the critic's behavior into account. Computing this hypergradient involves finding an inverse Hessian vector product, a process that can be numerically unstable. We thus propose a new algorithm, Bilevel Policy Optimization with Nystr""om Hypergradients (BLPO), which uses nesting to account for the nested structure of BLO, and leverages the Nystr""om method to compute the hypergradient. Theoretically, we prove BLPO converges to (a point that satisfies the necessary conditions for) a local strong Stackelberg equilibrium in polynomial time with high probability, assuming a linear parametrization of the critic's objective. Empirically, we demonstrate that BLPO performs on par with or better than PPO on a variety of discrete and continuous control tasks.