GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
Existing video world models struggle to maintain temporally consistent point-level motion, resulting in generated videos that lack physical plausibility and hinder reliable robotic manipulation. This work proposes a single-stream architecture that distills a pretrained geometric foundation model to obtain dense 4D correspondence supervision, jointly modeling appearance and geometry. An inverse dynamics module is introduced to directly translate consistent video predictions into executable robot trajectories. The method achieves geometrically consistent video generation without additional inference overhead, significantly enhancing deployment capability in both real-world and simulated environments. It attains state-of-the-art performance in video prediction and geometric consistency, improving real-world robotic manipulation success rates from 61% to 81%.
📝 Abstract
Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at the project page: https://anonymous-submission-20.github.io/gem.github.io/.
Problem

Research questions and friction points this paper is trying to address.

video world models
geometric consistency
robot manipulation
4D correspondence
physical grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-grounded
4D correspondence
video world model
inverse dynamics
robot manipulation