Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-objective robotic reinforcement learning, scalar reward aggregation suffers from hyperparameter sensitivity, suboptimal convergence, and poor scalability. To address this, we propose Gradient Conflict-Resolved PPO (GCR-PPO), which introduces a multi-head critic to decompose gradients at the objective level and integrates a priority-aware gradient coordination mechanism to explicitly model and resolve inter-objective gradient conflicts. Additionally, GCR-PPO incorporates vector-valued reward modeling and policy regularization. Evaluated on the IsaacLab benchmark, GCR-PPO achieves a statistically significant 9.5% average performance improvement over parallel PPO (p = 0.04), with notably larger gains in high-conflict tasks—without incurring additional computational overhead. This work is the first to systematically embed gradient conflict analysis and priority-based coordination into the PPO framework, substantially enhancing stability, scalability, and optimization efficiency of multi-objective RL for robotic control.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust robot locomotion in the real world, many tasks still require careful reward tuning and are brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation difficulty. In this work, we investigate the conflict between gradient contributions for each objective that emerge from scalarising the task objectives. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a modification to actor-critic optimisation that decomposes the actor update into objective-wise gradients using a multi-headed critic and resolves conflicts based on the objective priority. Our methodology, GCR-PPO, is evaluated on the well-known IsaacLab manipulation and locomotion benchmarks and additional multi-objective modifications on two related tasks. We show superior scalability compared to parallel PPO (p = 0.04), without significant computational overhead. We also show higher performance with more conflicting tasks. GCR-PPO improves on large-scale PPO with an average improvement of 9.5%, with high-conflict tasks observing a greater improvement. The code is available at https://github.com/humphreymunn/GCR-PPO.
Problem

Research questions and friction points this paper is trying to address.

Addresses gradient conflicts in multi-objective robot reinforcement learning
Resolves reward scalarization issues in policy optimization for robotics
Improves scalability and performance in conflicting task objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-headed critic for objective-wise gradients
Resolves gradient conflicts by objective priority
Modifies actor-critic optimization for multi-objective RL
🔎 Similar Papers
No similar papers found.
H
Humphrey Munn
School of Electrical Engineering and Computer Science, University of Queensland, QLD. 4072, Australia
B
Brendan Tidd
Data61, CSIRO
P
Peter Böhm
School of Electrical Engineering and Computer Science, University of Queensland, QLD. 4072, Australia
Marcus Gallagher
Marcus Gallagher
University of Queensland
Computer scienceartificial intelligenceevolutionary computationmachine learningoptimization
D
David Howard
Data61, CSIRO