Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of enabling robots to autonomously learn and execute long-horizon furniture assembly tasks from abstract, multimodal pictorial assembly manuals. We propose a vision-language model (VLM)-based approach for structured manual parsing and hierarchical assembly graph modeling. Our method establishes, for the first time, an end-to-end mapping from multimodal manual instructions to executable 6D pose operations: it leverages VLMs for joint visual-linguistic understanding, integrates 6D pose estimation to infer spatial relationships among parts, and constructs a hierarchical assembly graph to support task decomposition and motion planning. Evaluated on real-world IKEA furniture assembly, the system achieves fully autonomous completion across multiple products, demonstrating high precision (mean pose error < 3 mm / 2°), strong generalization (zero-shot cross-model transfer), and robustness in long-sequence tasks. This work introduces a novel paradigm for embodied agents to interpret and operationalize structured procedural knowledge.

Technology Category

Application Category

📝 Abstract

Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real-world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.

Problem

Research questions and friction points this paper is trying to address.

Robots interpret abstract instruction manuals

Translate instructions into executable actions

Perform complex furniture assembly tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model for instruction

Hierarchical assembly graphs construction

6D pose estimation for components

🔎 Similar Papers

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation