🤖 AI Summary
This work addresses the challenge of computing hypergradients in bilevel reinforcement learning (RL) when the lower-level RL problem is nonconvex—rendering standard hypergradient estimation intractable. We propose the first fully first-order hypergradient characterization framework that dispenses with convexity assumptions on the lower-level RL problem, deriving computable hypergradients from a regularized RL fixed-point equation. Based on this, we design both model-based and model-free bilevel RL algorithms, establishing an $O(varepsilon^{-1})$ convergence rate under mild conditions. Notably, we reveal for the first time that the hypergradient intrinsically unifies exploration and exploitation. Via stochastic optimization analysis, we derive upper bounds on iteration and sample complexity. Empirical results validate the effectiveness of our model-free algorithm in policy optimization and environmental adaptation. The core contribution lies in breaking the convexity dependency, thereby establishing a rigorous bilevel optimization theory and efficient algorithmic framework for nonconvex lower-level RL.
📝 Abstract
Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we design both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms enjoy the convergence rate $O(epsilon^{-1})$. To extend the applicability, a stochastic version of the model-free algorithm is proposed, along with results on its iteration and sample complexity. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.