DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Video diffusion models for robotic manipulation suffer from limited controllability and poor spatiotemporal consistency; existing approaches typically rely on 2D or unimodal trajectory conditioning, hindering the generation of high-fidelity, executable demonstration videos. This work introduces depth-aware multi-orthogonal trajectory representations—integrating depth, semantics, shape, and motion—and proposes a conditional RGB-depth co-generation framework. We design a novel cross-modal attention mechanism and a depth-aware supervision strategy, and achieve, for the first time, end-to-end regression from generated videos to robot joint angles. Evaluated on Bridge V2, Berkeley Autolab, and simulation benchmarks, our method significantly improves visual fidelity and spatiotemporal coherence, achieving superior robotic manipulation success rates over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Generates controllable robotic demonstration videos from depth trajectories.

Enhances spatio-temporal consistency via aligned RGB and depth generation.

Improves manipulation success rates using multimodal policy conditioning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-aware trajectory conditioning for robotic video generation

Joint RGB-depth generation with cross-modality attention

Multimodal policy model for joint angle regression

🔎 Similar Papers

Human Demonstrations are Generalizable Knowledge for Robots