🤖 AI Summary
This work addresses the challenge in multi-objective reinforcement learning (MORL) where single-policy approaches often fail to fully recover the Pareto front due to gradient interference and policy representation collapse. To overcome this, the authors propose the D³PO framework, which decouples the optimization of individual objectives to preserve distinct learning signals, delays preference fusion, and introduces a scaled diversity regularizer to enhance policy sensitivity to preferences. As the first method to systematically identify and mitigate gradient interference and representation collapse in preference-conditioned policies, D³PO achieves state-of-the-art or comparable performance using only a single deployable policy across multiple MORL benchmarks, significantly improving both the coverage and quality of the recovered Pareto front as measured by hypervolume and expected utility metrics.
📝 Abstract
Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.