๐ค AI Summary
In generative world models, insufficient precise controllability over camera poses hinders physically consistent novel-view synthesis and dynamic scene modeling. To address this, we propose a pose-driven bidirectional photometric deformation mechanism coupled with a backward pose regression lossโenabling high-fidelity, editable camera motion control without requiring ground-truth pose annotations for the first time. Our method integrates self-supervised depth and pose estimation, structured optical flow modeling, photometric consistency constraints, and backward frame warping, and is compatible with both diffusion- and autoregressive-based architectures. Evaluated on autonomous driving and general video datasets, our approach reduces pose control error by 37% compared to prior methods, achieves state-of-the-art geometric consistency in generated frames, and significantly enhances structural understanding and motion reasoning capabilities.
๐ Abstract
Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.