🤖 AI Summary
Traditional multi-objective reinforcement learning (MORL) methods relying on large-population evolutionary search suffer from low sample efficiency and high environmental interaction overhead. To address this, we propose a population-free, unified online/offline Pareto front tracking framework. Our approach operates in four stages: Pareto vertex policy initialization, continuous front tracking, dynamic weight adjustment for sparse regions, and policy aggregation—thereby eliminating dependence on large-scale population evolution. The proposed Pareto tracking mechanism, coupled with adaptive sparse-weight sampling, significantly improves front coverage and sample efficiency. Evaluated on seven continuous-control benchmarks, our method achieves superior hypervolume performance compared to state-of-the-art approaches, while requiring fewer environment interactions and lower hardware overhead.
📝 Abstract
Multi-objective reinforcement learning (MORL) plays a pivotal role in addressing multi-criteria decision-making problems in the real world. The multi-policy (MP) based methods are widely used to obtain high-quality Pareto front approximation for the MORL problems. However, traditional MP methods only rely on the online reinforcement learning (RL) and adopt the evolutionary framework with a large policy population. This may lead to sample inefficiency and/or overwhelmed agent-environment interactions in practice. By forsaking the evolutionary framework, we propose the novel Multi-policy Pareto Front Tracking (MPFT) framework without maintaining any policy population, where both online and offline MORL algorithms can be applied. The proposed MPFT framework includes four stages: Stage 1 approximates all the Pareto-vertex policies, whose mapping to the objective space fall on the vertices of the Pareto front. Stage 2 designs the new Pareto tracking mechanism to track the Pareto front, starting from each of the Pareto-vertex policies. Stage 3 identifies the sparse regions in the tracked Pareto front, and introduces a new objective weight adjustment method to fill the sparse regions. Finally, by combining all the policies tracked in Stages 2 and 3, Stage 4 approximates the Pareto front. Experiments are conducted on seven different continuous-action robotic control tasks with both online and offline MORL algorithms, and demonstrate the superior hypervolume performance of our proposed MPFT approach over the state-of-the-art benchmarks, with significantly reduced agent-environment interactions and hardware requirements.