DeepVerse: 4D Autoregressive Video Generation as a World Model

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing interactive world models predict only visual observations, neglecting critical latent states such as geometric structure and spatial consistency—leading to error accumulation and temporal inconsistency. This work introduces a 4D interactive world model for AGI, the first to explicitly model geometric state as a latent variable within a 4D autoregressive framework. We design a geometry-aware memory retrieval mechanism that jointly incorporates geometric constraints and action-conditional spatiotemporal prediction, while implicitly learning physical dynamics. The model achieves stable long-horizon generation exceeding 100 frames across diverse scenarios, with a 32% improvement in prediction accuracy and significantly enhanced visual fidelity and scene plausibility. Our core innovation lies in the tight integration of explicit geometric modeling with 4D autoregression, effectively mitigating temporal drift and error propagation.

Technology Category

Application Category

📝 Abstract

World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.

Problem

Research questions and friction points this paper is trying to address.

Addresses error accumulation in 4D video generation models

Enhances spatio-temporal consistency with geometric constraints

Improves long-term prediction accuracy and visual realism

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D autoregressive video generation model

Incorporates geometric constraints for predictions

Enhances spatio-temporal consistency and realism

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency