VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

πŸ“… 2026-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing world models struggle to generate synthetic videos aligned with action trajectories, limiting data augmentation for robotic policy learning. This work proposes a flow-matching-based dual-stream generative framework that jointly models video and action sequences under visual-language conditioning. By integrating a synchronized denoising mechanism and adaptive 3D pooling, the approach enhances cross-modal consistency. It achieves, for the first time, high-quality end-to-end generation of video–action pairs, thereby avoiding error accumulation inherent in two-stage pipelines. Experiments demonstrate that the generated trajectories exhibit strong consistency and executability in both simulation and real-world environments, significantly improving the generalization capability of downstream policies.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.
Problem

Research questions and friction points this paper is trying to address.

video-action generation
embodied data synthesis
world models
action trajectories
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-stream generation
flow matching
video-action alignment
embodied data synthesis
adaptive 3D pooling
πŸ”Ž Similar Papers
No similar papers found.