From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models often violate physical laws, resulting in unrealistic motion. To address this, we propose TrajVLM-Gen, a two-stage physically consistent image-to-video generation framework. In the first stage, we introduce a visual-language model (VLM) to predict coarse-grained motion trajectories grounded in real-world dynamics, jointly leveraging semantic understanding and prior physical knowledge—a novel approach. In the second stage, a trajectory-guided attention mechanism enables fine-grained video synthesis. To support this paradigm, we construct the first trajectory prediction dataset explicitly designed for physical plausibility. Evaluated on UCF-101 and MSR-VTT, TrajVLM-Gen achieves Fréchet Video Distance (FVD) scores of 545 and 539, respectively—substantially outperforming state-of-the-art methods. Qualitative and quantitative analyses confirm significant improvements in both physical consistency and visual fidelity of generated videos.

Technology Category

Application Category

📝 Abstract
Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.
Problem

Research questions and friction points this paper is trying to address.

Addressing physically inconsistent motion in video generation
Predicting coarse-grained trajectories with real-world physics
Guiding video generation through attention-based motion refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model predicts physics-consistent motion trajectories
Attention mechanisms guide fine-grained video generation
Trajectory dataset built from realistic video tracking data
🔎 Similar Papers
No similar papers found.
F
Fan Yang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences; Peng Cheng Laboratory, Shenzhen, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China
Z
Zhiyang Chen
MAPLE Lab, Westlake University
Yousong Zhu
Yousong Zhu
Associate Professor, Chinese Academy of Sciences, Institute of Automation
Multimodal Large Language ModelsSelf-supervised LearningObject Detection
X
Xin Li
Peng Cheng Laboratory, Shenzhen, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences; Peng Cheng Laboratory, Shenzhen, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China; Wuhan AI Research, Wuhan, China