Aether: Geometric-Aware Unified World Modeling

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Achieving geometrically aware, human-level spatial reasoning remains a fundamental challenge in AI. This paper introduces the first unified 4D world model that jointly optimizes dynamic geometric reconstruction, action-conditioned video prediction, and goal-directed visual planning. Methodologically, we propose a novel task-interleaved feature learning mechanism that leverages camera trajectories as geometric priors to construct an action space, integrating multi-task joint optimization, geometrically constrained representation learning, and joint action-geometry modeling. Experiments demonstrate that our model achieves reconstruction accuracy on par with or surpassing dedicated single-task models; enables zero-shot cross-domain generalization and strong synthetic-to-real transfer; performs action-following and reconstruction without real-world supervision; and significantly improves the physical plausibility of visual planning and predictive dynamics.

Technology Category

Application Category

📝 Abstract
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
Problem

Research questions and friction points this paper is trying to address.

Integrating geometric reconstruction and generative modeling for spatial reasoning
Enabling geometry-aware reasoning through dynamic reconstruction and video prediction
Achieving zero-shot generalization in action and reconstruction tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly optimizes 4D reconstruction and prediction
Task-interleaved feature learning for synergy
Geometry-informed action space for planning
🔎 Similar Papers
No similar papers found.