🤖 AI Summary
This work addresses the challenges of spatial reference ambiguity, task decomposition difficulty, and opaque decision-making that arise when using natural language instructions for long-horizon robotic manipulation. To overcome these issues, the authors propose employing editable visual sketches as an explicit intermediate representation, establishing a closed-loop “perceive–reason–sketch–act” workflow that precisely maps linguistic intent onto scene geometry. Through a multi-stage curriculum training framework, the approach integrates modality alignment, language-to-sketch consistency constraints, and sketch-guided reinforcement imitation learning to enable adaptive policy learning and real-time action prediction. Evaluated in both simulated and real-world complex environments, the method significantly improves task success rates and dynamic robustness while supporting human-in-the-loop correction and yielding interpretable, intervenable decision processes.
📝 Abstract
Long-horizon robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision-Language-Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines referential grounding in cluttered or underspecified scenes, impedes effective task decomposition of long-horizon goals with close-loop interaction, and limits causal explanation by obscuring the rationale behind action choices. To address these issues, we first introduce Visual Sketch, an implausible visual intermediate that renders points, boxes, arrows, and typed relations in the robot's current views to externalize spatial intent, connect language to scene geometry. Building on Visual Sketch, we present Action-Sketcher, a VLA framework that operates in a cyclic See-Think-Sketch-Act workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate diverse corpus with interleaved images, text, Visual Sketch supervision, and action sequences, and train Action-Sketcher with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Extensive experiments on cluttered scenes and multi-object tasks, in simulation and on real-world tasks, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans. Project website: https://action-sketcher.github.io