RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

📅 2024-03-28
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Embodied agents suffer from a lack of composable, primitive-level action representations and large-scale, real-world manipulation data. Method: This paper introduces RH20T-P—the first skill-level, real-world robotic manipulation dataset—comprising 38K video segments across 67 tasks, all meticulously annotated with composable operational primitives. It establishes the first standardized, systematically defined taxonomy of robot manipulation primitives and proposes the “Planning–Execution” Composable Generalization Agent (CGA) paradigm, leveraging Vision-Language Models for task decomposition and an end-to-end composable planner (RA-P). Contribution/Results: Experiments demonstrate substantial improvements in zero-shot out-of-distribution task transfer performance, empirically validating the critical role of primitive-level composability in generalization. RH20T-P provides both foundational data infrastructure and a methodological framework for scalable, composable embodied intelligence.

Technology Category

Application Category

📝 Abstract
Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-Language Models (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It is also promising for robotic manipulation to adapt such composable generalization ability, in the form of composable generalization agents (CGAs). However, the community lacks of reliable design of primitive skills and a sufficient amount of primitive-level data annotations. Therefore, we propose RH20T-P, a primitive-level robotic manipulation dataset, which contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios. Each clip is manually annotated according to a set of meticulously designed primitive skills that are common in robotic manipulation. Furthermore, we standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P, whose positive performance on solving unseen tasks validates that the proposed dataset can offer composable generalization ability to robotic manipulation agents.
Problem

Research questions and friction points this paper is trying to address.

robotics
action primitives
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

RH20T-P Dataset
Adaptive Robot Learning
Unknown Task Adaptation
🔎 Similar Papers
No similar papers found.
Z
Zeren Chen
Shanghai AI Laboratory, School of Software, Beihang University
Z
Zhelun Shi
Shanghai AI Laboratory, School of Software, Beihang University
X
Xiaoya Lu
University of Electronic Science and Technology of China
Lehan He
Lehan He
Nanjing university of posts and telecommunications
computer scienceArtificial intelligence
S
Sucheng Qian
Shanghai AI Laboratory, Shanghai Jiao Tong University
H
Haoshu Fang
Shanghai Jiao Tong University
Z
Zhen-fei Yin
Shanghai AI Laboratory, University of Sydney
W
Wanli Ouyang
Shanghai AI Laboratory, University of Sydney
Jing Shao
Jing Shao
Research Scientist, Shanghai AI Laboratory/Shanghai Jiao Tong University
Computer VisionMulti-Modal Large Language Model
Y
Yu Qiao
Shanghai AI Laboratory
C
Cewu Lu
Shanghai Jiao Tong University
Lu Sheng
Lu Sheng
School of Software, Beihang University
Embodied AI3D VisionMachine Learning