MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing world models struggle to simultaneously achieve geometrically consistent multi-view 4D dynamic prediction and executable action generation, with inverse dynamics inference often being ill-posed. This work proposes an embodied 4D world model that, given only a single-view RGB-D input, generates future RGB-D sequences that are geometrically consistent across arbitrary viewpoints. Cross-view and cross-modal feature fusion ensures RGB-D consistency, while trajectory-level latent optimization at test time, combined with a residual inverse dynamics model, translates future predictions into executable actions. Experiments on three datasets demonstrate that the proposed method significantly outperforms existing approaches in both 4D scene generation and downstream manipulation tasks, validating the effectiveness of its core design components.

Technology Category

Application Category

📝 Abstract
World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
Problem

Research questions and friction points this paper is trying to address.

4D world model
robotic manipulation
scene dynamics prediction
inverse dynamics
multi-view consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D world model
view-consistent generation
test-time action inference
cross-modality fusion
trajectory-level latent optimization
J
Jiaxu Wang
MMLab, The Chinese University of Hong Kong
Y
Yicheng Jiang
The Hong Kong University of Science and Technology
T
Tianlun He
The Hong Kong University of Science and Technology
J
Jingkai Sun
The University of Hong Kong
Qiang Zhang
Qiang Zhang
X-Humanoid
Humanoid RoboticsEmbodied AIRobotics
J
Junhao He
The Hong Kong University of Science and Technology
Jiahang Cao
Jiahang Cao
The University of Hong Kong
Robot LearningGenerative ModelsCognitive-inspired Models
Z
Zesen Gan
The Hong Kong University of Science and Technology
Mingyuan Sun
Mingyuan Sun
Northeastern University
RoboticsMachine LearningNeural Rendering
Qiming Shao
Qiming Shao
HKUST / UCLA / Tsinghua University
Topological spintronicsSpin-orbitronicsMagnetic insulatorsQuantum devicesEfficient Learning
Xiangyu Yue
Xiangyu Yue
The Chinese University of Hong Kong / UC Berkeley / Stanford University / NJU
Artificial IntelligenceComputer VisionMulti-modal Learning