BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the misalignment between coordinate-space actions and pixel-space video, view sensitivity, and lack of architectural consistency across embodied agents in world modeling. To resolve these issues, the authors propose a URDF- and camera-parameter-based rendering approach that generates pixel-aligned embodied masks, which are injected into a pretrained video generation model via a ControlNet-style pathway. This design enables precise alignment between action control signals and video predictions, supports multi-view conditional modeling, and establishes a unified cross-embodiment architecture. Furthermore, a flow-based motion loss is introduced to emphasize dynamic regions, significantly enhancing viewpoint robustness. Evaluated on the DROID and AgiBot-G1 datasets, the method substantially improves video generation quality and demonstrates successful real-world applications in policy evaluation and goal-directed planning.

Technology Category

Application Category

📝 Abstract
Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .
Problem

Research questions and friction points this paper is trying to address.

embodied world models
video generation
action-pixel misalignment
camera viewpoint sensitivity
unified architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

embodiment masks
video generation
world models
ControlNet-style conditioning
flow-based motion loss
🔎 Similar Papers
No similar papers found.
Y
Yixiang Chen
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Peiyan Li
Peiyan Li
Ludwig-Maximilians-Universität München
data mininggraph mining
J
Jiabing Yang
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Keji He
Keji He
SDU << CASIA & NUS
Cross-modal LearningEmbodied AI
X
Xiangnan Wu
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences
Yuan Xu
Yuan Xu
Associate Professor, Cumming School of Medicine, University of Caglary
Health Data MethodsEpidemiologyHealth Services Research
K
Kai Wang
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
J
Jing Liu
FiveAges
N
Nianfeng Liu
FiveAges
Yan Huang
Yan Huang
Institute of Automation, Chinese Academy of Sciences
computer visiondeep learningmultimodal learning
Liang Wang
Liang Wang
National Lab of Pattern Recognition
Computer VisionPattern RecognitionMachine Learning