π€ AI Summary
This work addresses the low sample efficiency of Conditional Value-at-Risk (CVaR) policy gradient methods, which stems from their exclusive reliance on tail trajectories. To overcome this limitation, the authors propose an augmented optimization objective that incorporates an expected quantile term, enabling full utilization of all sampled trajectories through quantile dynamic programming while preserving the original CVaR optimization goal. This approach constitutes the first method to achieve efficient, full-sample CVaR policy optimization within the class of Markov policies. Empirical evaluations across multiple environments with verifiable risk-averse behaviors demonstrate that the proposed method significantly outperforms conventional CVaR-PG and other state-of-the-art risk-sensitive reinforcement learning algorithms in terms of both sample efficiency and performance.
π Abstract
Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.