Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models

📅 2025-10-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing robotic assembly methods often neglect connector modeling, resulting in insufficient robustness. This paper proposes a unified assembly framework centered on connection relationships: for the first time, connectors are treated as first-class entities, with explicit modeling of their types, specifications, quantities, and spatial configurations; we construct the first large-scale assembly dataset supporting multiple connection types; we integrate vision-language models to parse illustrated assembly manuals, generating hierarchical graph representations that encode parts, subassemblies, and explicit connections; and we employ a hierarchical graph neural network to achieve end-to-end mapping from natural language instructions to executable robotic skills. Evaluated across four complex domains—furniture, toys, and industrial components—the framework demonstrates accurate identification and physically grounded execution of diverse connection types in simulation.

Technology Category

Application Category

📝 Abstract

Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the critical "last mile" of assembly execution, while task planning may sequence operations and motion plan may position parts, the precise establishment of physical connections ultimately determines assembly success or failure. In this paper, we consider connections as first-class primitives in assembly representation, including connector types, specifications, quantities, and placement locations. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence.

Problem

Research questions and friction points this paper is trying to address.

Treating physical connectors as primary assembly elements rather than afterthoughts

Automating extraction of structured connection data from instruction manuals

Representing assembly tasks through hierarchical graphs modeling connection relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

Models assembly tasks as hierarchical connection graphs

Extracts connector data from manuals via vision-language models

Treats physical connections as first-class assembly primitives

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance