🤖 AI Summary
Robot manipulation suffers from scarcity of wrist-view video data, and existing world models cannot synthesize wrist-view videos solely from anchor-view inputs. Method: We propose the first 4D world model for cross-view video generation, integrating 4D point cloud modeling with a spatiotemporal coherent generation architecture. To ensure geometrically consistent view transformation, we introduce a spatial projection consistency loss; to enhance visual fidelity, we adopt a VGGT-extended reconstruction module. Contribution/Results: Evaluated on Droid, CALVIN, and Franka Panda datasets, our method achieves state-of-the-art video generation performance—improving CALVIN task completion rate by 3.81% and bridging 42.4% of the anchor-to-wrist viewpoint gap. This significantly strengthens the manipulation generalization capability of vision-language-action models.
📝 Abstract
Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.