🤖 AI Summary
Existing video diffusion models struggle to generate object motions with accurate 6D pose estimation and 3D consistency under large-angle rotations, primarily due to the absence of geometric constraints in mapping 2D trajectories to 3D poses. To address this, we propose a two-stage pose-aware pretraining framework: (1) constructing PoseTraj-10K, a large-scale synthetic dataset supervised by 3D bounding boxes as intermediate geometric signals; and (2) introducing a camera-decoupled module that explicitly models camera parameters to enforce motion geometric consistency. Our approach is the first to deeply integrate 6D pose estimation, 3D geometric priors, and video diffusion modeling. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks: +12.7% pose alignment under rotational trajectories, +9.4% trajectory control accuracy, and improved video quality (FVD reduced by 18.3%).
📝 Abstract
Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.