PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

📅 2025-05-03

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

In generative world models, insufficient precise controllability over camera poses hinders physically consistent novel-view synthesis and dynamic scene modeling. To address this, we propose a pose-driven bidirectional photometric deformation mechanism coupled with a backward pose regression loss—enabling high-fidelity, editable camera motion control without requiring ground-truth pose annotations for the first time. Our method integrates self-supervised depth and pose estimation, structured optical flow modeling, photometric consistency constraints, and backward frame warping, and is compatible with both diffusion- and autoregressive-based architectures. Evaluated on autonomous driving and general video datasets, our approach reduces pose control error by 37% compared to prior methods, achieves state-of-the-art geometric consistency in generated frames, and significantly enhances structural understanding and motion reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

Problem

Research questions and friction points this paper is trying to address.

Enhancing camera pose controllability in generative world models

Improving viewpoint precision with self-supervised depth estimation

Achieving physically consistent viewpoint synthesis in simulations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages self-supervised depth for pose control

Uses photometric warping loss for consistency

Introduces reverse warping for precise estimation

🔎 Similar Papers

GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization