🤖 AI Summary
This work addresses trajectory-level nonlinear preference optimization in multi-objective reinforcement learning—specifically, maximizing the expected scalarized return (ESR) under a smooth, nonlinear aggregation function over cumulative rewards in multi-objective MDPs. To overcome the limitation of linear scalarization in capturing time-coupled optimality, we introduce the first extended Bellman optimality principle for nonlinear scalarization. Based on this principle, we propose the first falsifiable, pseudo-polynomial-time approximation algorithm capable of computing nonstationary policies under smooth scalarizers and fixed-dimensional reward vectors. We establish a bounded approximation ratio for the algorithm. Empirical evaluation across multiple benchmark tasks demonstrates that our ESR-based approach improves performance by 37%–62% over linear-weighted baselines, significantly enhancing the expressiveness and fidelity of nonlinear preference modeling.
📝 Abstract
We study multi-objective reinforcement learning with nonlinear preferences over trajectories. That is, we maximize the expected value of a nonlinear function over accumulated rewards (expected scalarized return or ESR) in a multi-objective Markov Decision Process (MOMDP). We derive an extended form of Bellman optimality for nonlinear optimization that explicitly considers time and current accumulated reward. Using this formulation, we describe an approximation algorithm for computing an approximately optimal non-stationary policy in pseudopolynomial time for smooth scalarization functions with a constant number of rewards. We prove the approximation analytically and demonstrate the algorithm experimentally, showing that there can be a substantial gap between the optimal policy computed by our algorithm and alternative baselines.