V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of acquiring large-scale, diverse manipulation data for general-purpose robot training by proposing the first end-to-end automated simulation data generation framework. It leverages natural language instructions to drive open-vocabulary 3D scene synthesis and executable expert trajectories. The key innovation lies in employing a video generation model as a motion prior—combined with geometric constraint verification—to enable high-fidelity, high-diversity behavioral synthesis without human intervention. The system integrates a large language model, a 3D generative model, a video generation model, and a Sim-to-Gen visual-motor alignment module (incorporating CoTracker3 and VGGT). Policies trained on Piper robotic arm tabletop tasks not only generalize to unseen objects in simulation but also achieve successful sim-to-real transfer, effectively manipulating a variety of novel real-world objects.

Technology Category

Application Category

📝 Abstract
Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. This pipeline supports high visual diversity and physical fidelity without manual intervention. To evaluate the generated data, we train imitation learning policies on synthesized trajectories encompassing diverse object and environment variations. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer, successfully manipulating novel real-world objects.
Problem

Research questions and friction points this paper is trying to address.

robotic simulation
trajectory synthesis
manipulation data
sim-to-real transfer
open-vocabulary environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

video generation priors
automated simulation
trajectory synthesis
visual-kinematic alignment
open-vocabulary manipulation
🔎 Similar Papers
No similar papers found.