Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of spatial reference ambiguity, task decomposition difficulty, and opaque decision-making that arise when using natural language instructions for long-horizon robotic manipulation. To overcome these issues, the authors propose employing editable visual sketches as an explicit intermediate representation, establishing a closed-loop “perceive–reason–sketch–act” workflow that precisely maps linguistic intent onto scene geometry. Through a multi-stage curriculum training framework, the approach integrates modality alignment, language-to-sketch consistency constraints, and sketch-guided reinforcement imitation learning to enable adaptive policy learning and real-time action prediction. Evaluated in both simulated and real-world complex environments, the method significantly improves task success rates and dynamic robustness while supporting human-in-the-loop correction and yielding interpretable, intervenable decision processes.

Technology Category

Application Category

📝 Abstract
Long-horizon robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision-Language-Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines referential grounding in cluttered or underspecified scenes, impedes effective task decomposition of long-horizon goals with close-loop interaction, and limits causal explanation by obscuring the rationale behind action choices. To address these issues, we first introduce Visual Sketch, an implausible visual intermediate that renders points, boxes, arrows, and typed relations in the robot's current views to externalize spatial intent, connect language to scene geometry. Building on Visual Sketch, we present Action-Sketcher, a VLA framework that operates in a cyclic See-Think-Sketch-Act workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate diverse corpus with interleaved images, text, Visual Sketch supervision, and action sequences, and train Action-Sketcher with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Extensive experiments on cluttered scenes and multi-object tasks, in simulation and on real-world tasks, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans. Project website: https://action-sketcher.github.io
Problem

Research questions and friction points this paper is trying to address.

long-horizon robotic manipulation
referential grounding
task decomposition
causal explanation
Vision-Language-Action
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Sketch
Vision-Language-Action (VLA)
Long-horizon Manipulation
Interpretable Robotics
Token-gated Reasoning
Huajie Tan
Huajie Tan
Peking University
Embodied AIFoundation Models
P
P. Co
Beijing Academy of Artificial Intelligence
Yijie Xu
Yijie Xu
Hong Kong University of Science and Technology (Guangzhou)
Data MiningNatural Language ProcessingLarge Language Models
S
Shanyu Rong
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yuheng Ji
Yuheng Ji
Institute of Automation, Chinese Academy of Sciences
Embodied AIComputer Vision
Cheng Chi
Cheng Chi
Columbia University, Stanford University
robotics
X
Xiansheng Chen
Beijing Academy of Artificial Intelligence
Q
Qiongyu Zhang
University of Sydney
Z
Zhongxia Zhao
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Zhongyuan Wang
Zhongyuan Wang
BAAI
Knowledge MiningDatabaseNLPText Understanding
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models