🤖 AI Summary
Natural language interaction for robotic spatial control suffers from ambiguity and low precision. This paper introduces RoVI (Robot Vision Instructions), a paradigm that replaces textual commands with object-centric hand-drawn symbols—such as arrows, circles, colors, and numerals—to enable high-precision spatial guidance. Its core contributions are threefold: (1) the first vision-based instruction representation explicitly designed for robotics; (2) the construction of RoVI, the first large-scale, fine-tuning dataset comprising 15K annotated vision-instruction–action triplets; and (3) VIEW, an end-to-end vision-to-action mapping framework integrating a vision-language model (VLM), 2D keypoint parsing, and 3D action sequence generation, featuring a lightweight, edge-deployable VLM fine-tuning strategy. Evaluated on 11 novel tasks, VIEW achieves an 87.5% success rate on unseen multi-step tasks in real-world settings, demonstrating substantial improvements in robustness and trajectory tracking fidelity.
📝 Abstract
Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision for robotic control introduces challenges such as ambiguity and verbosity. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment, enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Code and Datasets in this paper will be released soon.