🤖 AI Summary
Existing research leverages appliance manuals solely for question answering, overlooking their critical role in guiding multi-step, multi-page operational procedures. To address this gap, we propose a manual-driven operational planning paradigm and introduce CheckManual—the first benchmark for instruction manual understanding and autonomous appliance operation—featuring CAD-synthesized multimodal manuals, a PyBullet-based interactive simulation environment, and comprehensive, multi-dimensional evaluation metrics. We design a manual-action joint embedding scheme and a stepwise planning architecture, yielding the end-to-end model ManualPlan. Furthermore, we establish a large-language-model-assisted, human-validated pipeline for synthetic manual generation. Systematic evaluation on CheckManual demonstrates that ManualPlan significantly outperforms state-of-the-art multimodal foundation models and embodied agents, achieving the first quantitative breakthroughs in task completion rate, step accuracy, and manual adherence.
📝 Abstract
Correct use of electrical appliances has significantly improved human life quality. Unlike simple tools that can be manipulated with common sense, different parts of electrical appliances have specific functions defined by manufacturers. If we want the robot to heat bread by microwave, we should enable them to review the microwave manual first. From the manual, it can learn about component functions, interaction methods, and representative task steps about appliances. However, previous manual-related works remain limited to question-answering tasks while existing manipulation researchers ignore the manual's important role and fail to comprehend multi-page manuals. In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. Specifically, we design a large model-assisted human-revised data generation pipeline to create manuals based on CAD appliance models. With these manuals, we establish novel manual-based manipulation challenges, metrics, and simulator environments for model performance evaluation. Furthermore, we propose the first manual-based manipulation planning model ManualPlan to set up a group of baselines for the CheckManual benchmark.