PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current fine-grained robotic manipulation research is hindered by the absence of large-scale 3D object datasets with part-level annotations and corresponding instructional tasks. To address this, we introduce PartInstruct—the first large-scale benchmark tailored to this challenge—comprising 513 part-segmented 3D object instances and 1,302 part-level manipulation instructions, enabling generalization evaluation across objects, states, and tasks. We propose a novel part-instruction-driven evaluation paradigm, formally defining three core capabilities: part perception, spatial grounding, and long-horizon control. Leveraging 3D simulation, we generate over 10,000 expert demonstrations with precise part poses and skill chains. Comprehensive evaluation via end-to-end vision-language policies and hierarchical planning models reveals critical bottlenecks in part-concept grounding, 3D action prediction, and long-sequence execution. PartInstruct provides reproducible baselines and clear directions for advancement.

Technology Category

Application Category

📝 Abstract
Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. In this work, we introduce PartInstruct, the first large-scale benchmark for training and evaluating fine-grained robot manipulation models using part-level instructions. PartInstruct comprises 513 object instances across 14 categories, each annotated with part-level information, and 1302 fine-grained manipulation tasks organized into 16 task classes. Our training set consists of over 10,000 expert demonstrations synthesized in a 3D simulator, where each demonstration is paired with a high-level task instruction, a chain of base part-based skill instructions, and ground-truth 3D information about the object and its parts. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art robot manipulation approaches, including end-to-end vision-language policy learning and bi-level planning models for robot manipulation on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts and predict actions in 3D space, and face challenges when manipulating object parts in long-horizon tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale datasets for fine-grained manipulation with part-level instructions
Need for robust reasoning about object parts in manipulation tasks
Challenges in grounding part concepts and predicting 3D actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with part-level instructions
10,000 expert demonstrations in 3D simulator
Comprehensive test suite for policy generalizability
🔎 Similar Papers
No similar papers found.