Generalizable Hierarchical Skill Learning via Object-Centric Representation

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address poor policy generalization and low sample efficiency in robotic manipulation, this paper proposes a generalizable hierarchical skill learning framework grounded in object-centric representations. The method integrates vision-language models (VLMs) with visuomotor policies, using object-centric skills as a unified interface between high-level semantics and low-level control. VLM guidance enables skill segmentation, canonical action mapping, and hierarchical policy learning—facilitating efficient skill transfer across varying scene layouts, object appearances, and compositional tasks. In simulation, the approach achieves a 15.5% performance gain over a baseline trained on 30× more demonstration data (i.e., only three demonstrations per task). In real-world experiments, it significantly outperforms state-of-the-art methods trained on 10× more data. The core contribution is a VLM-driven object-centric skill abstraction mechanism, which—uniquely—enables concurrent improvement of few-shot generalization across multiple dimensions: layout, appearance, and task composition.

Technology Category

Application Category

📝 Abstract

We present Generalizable Hierarchical Skill Learning (GSL), a novel framework for hierarchical policy learning that significantly improves policy generalization and sample efficiency in robot manipulation. One core idea of GSL is to use object-centric skills as an interface that bridges the high-level vision-language model and the low-level visual-motor policy. Specifically, GSL decomposes demonstrations into transferable and object-canonicalized skill primitives using foundation models, ensuring efficient low-level skill learning in the object frame. At test time, the skill-object pairs predicted by the high-level agent are fed to the low-level module, where the inferred canonical actions are mapped back to the world frame for execution. This structured yet flexible design leads to substantial improvements in sample efficiency and generalization of our method across unseen spatial arrangements, object appearances, and task compositions. In simulation, GSL trained with only 3 demonstrations per task outperforms baselines trained with 30 times more data by 15.5 percent on unseen tasks. In real-world experiments, GSL also surpasses the baseline trained with 10 times more data.

Problem

Research questions and friction points this paper is trying to address.

Improves robot policy generalization and sample efficiency

Bridges vision-language models with visual-motor policies

Enables adaptation to unseen spatial arrangements and objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric skills bridge vision-language models and motor policies

Decomposes demonstrations into transferable canonical skill primitives

Maps inferred canonical actions back to world frame for execution

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Authors to Follow