🤖 AI Summary
Existing methods for non-convex multi-objective reinforcement learning (MORL) suffer from difficulty in exploring Pareto-stationary policies and lack finite-time theoretical guarantees.
Method: We propose MOCHA, the first algorithm to deeply integrate weighted Chebyshev scalarization with the Actor-Critic framework, incorporating dynamic weight adaptation and gradient-based policy updates to systematically explore the Pareto-stationary policy set.
Contributions/Results: Theoretically, we establish the first finite-time sample complexity analysis for Pareto stationarity in non-convex MORL, proving a convergence rate of $ ilde{mathcal{O}}(varepsilon^{-2})$ that explicitly depends on the minimum component $p_{min}$ of the weight vector. Empirically, MOCHA achieves significant improvements over state-of-the-art MORL baselines on the large-scale offline KuaiRand dataset, demonstrating both theoretical rigor and practical effectiveness.
📝 Abstract
In many multi-objective reinforcement learning (MORL) applications, being able to systematically explore the Pareto-stationary solutions under multiple non-convex reward objectives with theoretical finite-time sample complexity guarantee is an important and yet under-explored problem. This motivates us to take the first step and fill the important gap in MORL. Specifically, in this paper, we propose a uline{M}ulti-uline{O}bjective weighted-uline{CH}ebyshev uline{A}ctor-critic (MOCHA) algorithm for MORL, which judiciously integrates the weighted-Chebychev (WC) and actor-critic framework to enable Pareto-stationarity exploration systematically with finite-time sample complexity guarantee. Sample complexity result of MOCHA algorithm reveals an interesting dependency on $p_{min}$ in finding an $ε$-Pareto-stationary solution, where $p_{min}$ denotes the minimum entry of a given weight vector $mathbf{p}$ in WC-scarlarization. By carefully choosing learning rates, the sample complexity for each exploration can be $ ilde{mathcal{O}}(ε^{-2})$. Furthermore, simulation studies on a large KuaiRand offline dataset, show that the performance of MOCHA algorithm significantly outperforms other baseline MORL approaches.