Towards Scalable General Utility Reinforcement Learning: Occupancy Approximation, Sample Complexity and Global Optimality

📅 2024-10-05

📈 Citations: 1

✨ Influential: 0

career value

194K/year

🤖 AI Summary

In general utility reinforcement learning (GURL) with large state-action spaces, occupancy measure estimation poses a scalability bottleneck. Method: This paper introduces the first integration of maximum-likelihood occupancy measure estimation under function approximation into the policy gradient framework. Contribution/Results: We theoretically establish that the estimation error depends only on the complexity of the function class (e.g., Rademacher complexity), independent of the state-action space size. For both non-concave and concave utility functions, we derive first-order stationarity and global optimality guarantees for the learned policy. Empirically, our method significantly improves training efficiency and generalization performance, outperforming tabular counting-based approaches on high-dimensional tasks. These results validate the theoretical advantage that sample complexity scales with function-class complexity—not state-space dimension—thereby enabling scalable GURL in continuous or large discrete domains.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with general utilities has recently gained attention thanks to its ability to unify several problems, including imitation learning, pure exploration, and safe reinforcement learning. However, prior work for solving this general problem in a unified way has only focused on the tabular setting. This is restrictive when considering larger state-action spaces because of the need to estimate occupancy measures during policy optimization. In this work, we address this issue and propose to approximate occupancy measures within a function approximation class using maximum likelihood estimation (MLE). We propose a simple policy gradient algorithm where an actor updates the policy parameters to maximize the general utility objective whereas a critic approximates the occupancy measure using MLE. We provide a statistical complexity analysis showing that our occupancy measure estimation error only scales with the dimension of our function approximation class rather than the size of the state action space. Under suitable assumptions, we establish first order stationarity and global optimality performance bounds for the proposed algorithm for nonconcave and concave general utilities respectively. We complement our methodological and theoretical findings with promising empirical results showing the scalability potential of our approach compared to existing tabular count-based approaches.

Problem

Research questions and friction points this paper is trying to address.

Scalable reinforcement learning with general utilities

Approximating occupancy measures using function classes

Overcoming limitations of tabular methods in large spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Approximates occupancy measures via MLE

Uses actor-critic policy gradient algorithm

Scales with function class dimension not state space

🔎 Similar Papers

No similar papers found.