Captain Safari: A World Engine

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to generate long-duration, 3D-consistent videos with interactive camera control in complex outdoor scenes, suffering from geometric instability, trajectory deviation, and conservative motion. To address this, we propose a pose-conditioned world memory mechanism: a persistent world memory bank coupled with dynamic local memory updates and pose-aligned world token retrieval, enabling high-fidelity 3D structural modeling and precise execution of aggressive 6-DoF camera trajectories. We design a pose-conditioned generative architecture and introduce OpenSafari—the first real-world, first-person outdoor video dataset—for evaluation. Experiments demonstrate state-of-the-art performance across 3D consistency (MEt3R = 0.3690), trajectory tracking accuracy (AUC@30 = 0.200), and video quality (significant FVD reduction). Human preference studies show a 67.6% preference rate for our method over prior approaches.

Technology Category

Application Category

📝 Abstract
World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.
Problem

Research questions and friction points this paper is trying to address.

Generates 3D-consistent videos from user-controlled camera paths
Maintains geometric coherence during aggressive 6-DoF camera maneuvers
Addresses challenges in complex outdoor scenes for interactive exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose-conditioned world memory for video generation
Dynamic local memory with pose-aligned token retrieval
Generates 3D-consistent videos along user-controlled camera paths
🔎 Similar Papers
No similar papers found.