Robotic Visual Instruction

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Natural language interaction for robotic spatial control suffers from ambiguity and low precision. This paper introduces RoVI (Robot Vision Instructions), a paradigm that replaces textual commands with object-centric hand-drawn symbols—such as arrows, circles, colors, and numerals—to enable high-precision spatial guidance. Its core contributions are threefold: (1) the first vision-based instruction representation explicitly designed for robotics; (2) the construction of RoVI, the first large-scale, fine-tuning dataset comprising 15K annotated vision-instruction–action triplets; and (3) VIEW, an end-to-end vision-to-action mapping framework integrating a vision-language model (VLM), 2D keypoint parsing, and 3D action sequence generation, featuring a lightweight, edge-deployable VLM fine-tuning strategy. Evaluated on 11 novel tasks, VIEW achieves an 87.5% success rate on unseen multi-step tasks in real-world settings, demonstrating substantial improvements in robustness and trajectory tracking fidelity.

Technology Category

Application Category

📝 Abstract

Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision for robotic control introduces challenges such as ambiguity and verbosity. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment, enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Code and Datasets in this paper will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Address ambiguity in robotic control via visual instructions

Encode spatial-temporal data into interpretable 2D sketches

Translate visual inputs to executable 3D action sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric hand-drawn symbolic representation for robotic tasks

Vision-Language Models decode 2D sketches into 3D actions

Specialized dataset fine-tunes VLMs for edge deployment

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance