L2D2: Robot Learning from 2D Drawings

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional imitation learning requires frequent physical human demonstrations, while 2D sketching struggles to encode complex 3D dynamic tasks. Method: We propose a sketch-driven lightweight imitation learning paradigm that enables novel task acquisition from a single hand-drawn 2D trajectory on a scene image. Our approach integrates vision-language segmentation (Grounding DINO + SAM) to generate diverse synthetic images, models 2D–3D geometric constraints, and distills sparse real-world demonstrations via few-shot learning—enabling cross-modal alignment from 2D sketches to 3D robot actions. Contribution/Results: This is the first end-to-end cross-dimensional grounding framework achieving “2D sketch → 3D execution” without environment reset. It exhibits strong generalization, supports long-horizon tasks (≥8 steps), and in user studies improves sketching efficiency by 3.2×, reduces demonstration data by 67%, and outperforms state-of-the-art methods by 19.4% in policy performance.

Technology Category

Application Category

📝 Abstract
Robots should learn new tasks from humans. But how do humans convey what they want the robot to do? Existing methods largely rely on humans physically guiding the robot arm throughout their intended task. Unfortunately -- as we scale up the amount of data -- physical guidance becomes prohibitively burdensome. Not only do humans need to operate robot hardware but also modify the environment (e.g., moving and resetting objects) to provide multiple task examples. In this work we propose L2D2, a sketching interface and imitation learning algorithm where humans can provide demonstrations by drawing the task. L2D2 starts with a single image of the robot arm and its workspace. Using a tablet, users draw and label trajectories on this image to illustrate how the robot should act. To collect new and diverse demonstrations, we no longer need the human to physically reset the workspace; instead, L2D2 leverages vision-language segmentation to autonomously vary object locations and generate synthetic images for the human to draw upon. We recognize that drawing trajectories is not as information-rich as physically demonstrating the task. Drawings are 2-dimensional and do not capture how the robot's actions affect its environment. To address these fundamental challenges the next stage of L2D2 grounds the human's static, 2D drawings in our dynamic, 3D world by leveraging a small set of physical demonstrations. Our experiments and user study suggest that L2D2 enables humans to provide more demonstrations with less time and effort than traditional approaches, and users prefer drawings over physical manipulation. When compared to other drawing-based approaches, we find that L2D2 learns more performant robot policies, requires a smaller dataset, and can generalize to longer-horizon tasks. See our project website: https://collab.me.vt.edu/L2D2/
Problem

Research questions and friction points this paper is trying to address.

Enabling robots to learn tasks from 2D human drawings
Reducing physical human effort in robot demonstration collection
Grounding 2D sketches in 3D dynamics for effective imitation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sketching interface for robot task demonstrations
Vision-language segmentation for synthetic image generation
Combining 2D drawings with physical demonstrations
🔎 Similar Papers
No similar papers found.