🤖 AI Summary
This work addresses the limitation of existing skill-centric approaches, which rely on fixed skill repositories and struggle to adapt to new tasks without human intervention. To overcome this, we propose Uni-Skill, a framework that integrates skill-aware planning with an automatic skill evolution mechanism to proactively acquire new skills when current capabilities are insufficient. Leveraging an offline-constructed SkillFolder repository, Uni-Skill enables efficient skill retrieval and introduces, for the first time, a self-evolving skill library coupled with a VerbNet-inspired hierarchical skill ontology. This paradigm shifts skill acquisition from manual annotation to automated, structured extraction from large-scale unlabeled robotic videos, thereby supporting zero-shot generalization. Experiments demonstrate that Uni-Skill significantly enhances both cross-task zero-shot generalization and complex task reasoning performance in both simulated and real-world environments.
📝 Abstract
While skill-centric approaches leverage foundation models to enhance generalization in compositional tasks, they often rely on fixed skill libraries, limiting adaptability to new tasks without manual intervention. To address this, we propose Uni-Skill, a Unified Skill-centric framework that supports skill-aware planning and facilitates automatic skill evolution. Unlike prior methods that restrict planning to predefined skills, Uni-Skill requests for new skill implementations when existing ones are insufficient, ensuring adaptable planning with self-augmented skill library. To support automatic implementation of diverse skills requested by the planning module, we construct SkillFolder, a VerbNet-inspired repository derived from large-scale unstructured robotic videos. SkillFolder introduces a hierarchical skill taxonomy that captures diverse skill descriptions at multiple levels of abstraction. By populating this taxonomy with large-scale, automatically annotated demonstrations, Uni-Skill shifts the paradigm of skill acquisition from inefficient manual annotation to efficient offline structural retrieval. Retrieved examples provide semantic supervision over behavior patterns and fine-grained references for spatial trajectories, enabling few-shot skill inference without deployment-time demonstrations. Comprehensive experiments in both simulation and real-world settings verify the state-of-the-art performance of Uni-Skill over existing VLM-based skill-centric approaches, highlighting its advanced reasoning capabilities and strong zero-shot generalization across a wide range of novel tasks.