🤖 AI Summary
This work addresses the challenge of maximizing nonlinear concave aggregate rewards in multi-objective reinforcement learning (MORL), introducing general nonlinear concave combination objectives into the MORL framework for the first time. We propose a model-free policy gradient algorithm featuring a biased yet convergent gradient estimator. We rigorously establish its sample complexity for achieving an ε-optimal policy as O(M⁴σ²/((1−γ)⁸ε⁴)), where the ε-dependence matches that of single-objective policy gradient methods. Our approach overcomes fundamental limitations of conventional linear scalarization and Pareto-frontier optimization in MORL, simultaneously ensuring strong modeling expressivity—particularly for naturally concave structures—and provable convergence. It provides a novel paradigm for long-horizon cooperative optimization problems in engineering applications, such as resource allocation and energy-efficiency balancing.
📝 Abstract
Many engineering problems have multiple objectives, and the overall aim is to optimize a non-linear function of these objectives. In this paper, we formulate the problem of maximizing a non-linear concave function of multiple long-term objectives. A policy-gradient based model-free algorithm is proposed for the problem. To compute an estimate of the gradient, a biased estimator is proposed. The proposed algorithm is shown to achieve convergence to within an $epsilon$ of the global optima after sampling $mathcal{O}(frac{M^4sigma^2}{(1-gamma)^8epsilon^4})$ trajectories where $gamma$ is the discount factor and $M$ is the number of the agents, thus achieving the same dependence on $epsilon$ as the policy gradient algorithm for the standard reinforcement learning.