Hierarchical Preference Optimization: Learning to achieve goals via feasible subgoals prediction

📅 2024-11-01

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Hierarchical reinforcement learning (HRL) suffers from two key challenges: non-stationarity in high-level policies due to evolving low-level policies, and infeasible sub-goals generated by high-level policies that low-level policies cannot execute. To address these, we propose Hierarchical Preference Optimization (HPO), the first framework to integrate token-level direct preference optimization (DPO) into HRL—without requiring a pretrained reference policy. HPO jointly optimizes high-level goal generation and low-level action selection via a bilevel optimization formulation. We introduce a primitive-regularized DPO loss that mathematically enforces sub-goal feasibility and prevents degenerate solutions. Additionally, maximum entropy regularization is incorporated to enhance exploration robustness. Evaluated on robotic navigation and manipulation tasks, HPO achieves an average 35% performance gain over strong baselines, significantly mitigating both non-stationarity and sub-goal infeasibility. Ablation studies and quantitative analysis comprehensively validate its effectiveness.

Technology Category

Application Category

📝 Abstract

This work introduces Hierarchical Preference Optimization (HPO), a novel approach to hierarchical reinforcement learning (HRL) that addresses non-stationarity and infeasible subgoal generation issues when solving complex robotic control tasks. HPO leverages maximum entropy reinforcement learning combined with token-level Direct Preference Optimization (DPO), eliminating the need for pre-trained reference policies that are typically unavailable in challenging robotic scenarios. Mathematically, we formulate HRL as a bi-level optimization problem and transform it into a primitive-regularized DPO formulation, ensuring feasible subgoal generation and avoiding degenerate solutions. Extensive experiments on challenging robotic navigation and manipulation tasks demonstrate impressive performance of HPO, where it shows an improvement of up to 35% over the baselines. Furthermore, ablation studies validate our design choices, and quantitative analyses confirm the ability of HPO to mitigate non-stationarity and infeasible subgoal generation issues in HRL.

Problem

Research questions and friction points this paper is trying to address.

Mitigates non-stationarity in hierarchical reinforcement learning

Addresses infeasible subgoal generation in policy decomposition

Decouples higher-level learning from lower-level reward signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

DPO optimizes higher-level policy with preferences

Bi-level formulation decouples non-stationary reward signals

Regularization ensures subgoal feasibility for lower policies

🔎 Similar Papers

DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning