🤖 AI Summary
Existing video generation models often violate physical laws, resulting in unrealistic motion. To address this, we propose TrajVLM-Gen, a two-stage physically consistent image-to-video generation framework. In the first stage, we introduce a visual-language model (VLM) to predict coarse-grained motion trajectories grounded in real-world dynamics, jointly leveraging semantic understanding and prior physical knowledge—a novel approach. In the second stage, a trajectory-guided attention mechanism enables fine-grained video synthesis. To support this paradigm, we construct the first trajectory prediction dataset explicitly designed for physical plausibility. Evaluated on UCF-101 and MSR-VTT, TrajVLM-Gen achieves Fréchet Video Distance (FVD) scores of 545 and 539, respectively—substantially outperforming state-of-the-art methods. Qualitative and quantitative analyses confirm significant improvements in both physical consistency and visual fidelity of generated videos.
📝 Abstract
Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.