Robotic Visual Instruction

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Natural language interaction for robotic spatial control suffers from ambiguity and low precision. This paper introduces RoVI (Robot Vision Instructions), a paradigm that replaces textual commands with object-centric hand-drawn symbols—such as arrows, circles, colors, and numerals—to enable high-precision spatial guidance. Its core contributions are threefold: (1) the first vision-based instruction representation explicitly designed for robotics; (2) the construction of RoVI, the first large-scale, fine-tuning dataset comprising 15K annotated vision-instruction–action triplets; and (3) VIEW, an end-to-end vision-to-action mapping framework integrating a vision-language model (VLM), 2D keypoint parsing, and 3D action sequence generation, featuring a lightweight, edge-deployable VLM fine-tuning strategy. Evaluated on 11 novel tasks, VIEW achieves an 87.5% success rate on unseen multi-step tasks in real-world settings, demonstrating substantial improvements in robustness and trajectory tracking fidelity.

Technology Category

Application Category

📝 Abstract
Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision for robotic control introduces challenges such as ambiguity and verbosity. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment, enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Code and Datasets in this paper will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Address ambiguity in robotic control via visual instructions
Encode spatial-temporal data into interpretable 2D sketches
Translate visual inputs to executable 3D action sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric hand-drawn symbolic representation for robotic tasks
Vision-Language Models decode 2D sketches into 3D actions
Specialized dataset fine-tunes VLMs for edge deployment
Y
Yanbang Li
Imperial College London
Ziyang Gong
Ziyang Gong
SJTU, THU, Shanghai AI Lab (OpenGVLab), SYSU
Embodied Spatial Intelligence
H
Haoyang Li
UC San Diego
X
Xiaoqi Huang
VIVO
H
Haolan Kang
South China University of Technology
G
Guangping Bai
Independent Researcher
Xianzheng Ma
Xianzheng Ma
VGG & AVL, University of Oxford
3D Computer VisionEmbodied AILarge Language Models