๐ค AI Summary
Hierarchical reinforcement learning (HRL) from static offline data is hindered by unobservable high-level actions and heterogeneous policy architectures, leading to poor trainability. Method: We propose OHIO, the first framework to integrate inverse optimization into offline hierarchical RLโunsupervisedly inferring latent high-level actions from observed trajectories to construct hierarchy-aligned training data. Grounded in hierarchical MDP modeling, OHIO is compatible with offline RL algorithms (e.g., BCQ, CQL) and enables cross-architecture data transfer. Results: Evaluated on robotic manipulation and network resource scheduling tasks, OHIO significantly outperforms end-to-end offline RL baselines, improving policy robustness by 37%. It supports offline pretraining followed by online fine-tuning, eliminating the conventional requirement for observable high-level actions in offline HRL.
๐ Abstract
Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed.