PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

๐Ÿ“… 2025-05-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In generative world models, insufficient precise controllability over camera poses hinders physically consistent novel-view synthesis and dynamic scene modeling. To address this, we propose a pose-driven bidirectional photometric deformation mechanism coupled with a backward pose regression lossโ€”enabling high-fidelity, editable camera motion control without requiring ground-truth pose annotations for the first time. Our method integrates self-supervised depth and pose estimation, structured optical flow modeling, photometric consistency constraints, and backward frame warping, and is compatible with both diffusion- and autoregressive-based architectures. Evaluated on autonomous driving and general video datasets, our approach reduces pose control error by 37% compared to prior methods, achieves state-of-the-art geometric consistency in generated frames, and significantly enhances structural understanding and motion reasoning capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing camera pose controllability in generative world models
Improving viewpoint precision with self-supervised depth estimation
Achieving physically consistent viewpoint synthesis in simulations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages self-supervised depth for pose control
Uses photometric warping loss for consistency
Introduces reverse warping for precise estimation
๐Ÿ”Ž Similar Papers
No similar papers found.
Bu Jin
Bu Jin
HKUST
3D generationAutonomous DrivingVision-Language Model
W
Weize Li
Institute for AI Industry Research (AIR), Tsinghua University
B
Baihan Yang
Institute for AI Industry Research (AIR), Tsinghua University; School of Computer Science and Technology, Beijing Jiaotong University
Zhenxin Zhu
Zhenxin Zhu
Xiaomi AD
AIGCNeRF
J
Junpeng Jiang
Li Auto
Huan-ang Gao
Huan-ang Gao
Ph.D. student, Tsinghua University
AgentVision & Robotics
H
Haiyang Sun
Li Auto
K
Kun Zhan
Li Auto
H
Hengtong Hu
Li Auto
Xueyang Zhang
Xueyang Zhang
Li Auto Inc.
Autonomous DrivingWorld Model3D Vision
P
Peng Jia
Li Auto
H
Hao Zhao
Institute for AI Industry Research (AIR), Tsinghua University