WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Robot manipulation suffers from scarcity of wrist-view video data, and existing world models cannot synthesize wrist-view videos solely from anchor-view inputs. Method: We propose the first 4D world model for cross-view video generation, integrating 4D point cloud modeling with a spatiotemporal coherent generation architecture. To ensure geometrically consistent view transformation, we introduce a spatial projection consistency loss; to enhance visual fidelity, we adopt a VGGT-extended reconstruction module. Contribution/Results: Evaluated on Droid, CALVIN, and Franka Panda datasets, our method achieves state-of-the-art video generation performance—improving CALVIN task completion rate by 3.81% and bridging 42.4% of the anchor-to-wrist viewpoint gap. This significantly strengthens the manipulation generalization capability of vision-language-action models.

Technology Category

Application Category

📝 Abstract

Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

Problem

Research questions and friction points this paper is trying to address.

Generating wrist-view videos from anchor views for robotics

Bridging the visual gap between abundant and scarce viewpoints

Enhancing manipulation performance through synthesized wrist observations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates wrist-view videos from anchor views

Uses Spatial Projection Consistency Loss for geometry

Employs two-stage reconstruction and generation approach

🔎 Similar Papers

Survey on Modeling of Human-made Articulated Objects