From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing video generation models often violate physical laws, resulting in unrealistic motion. To address this, we propose TrajVLM-Gen, a two-stage physically consistent image-to-video generation framework. In the first stage, we introduce a visual-language model (VLM) to predict coarse-grained motion trajectories grounded in real-world dynamics, jointly leveraging semantic understanding and prior physical knowledge—a novel approach. In the second stage, a trajectory-guided attention mechanism enables fine-grained video synthesis. To support this paradigm, we construct the first trajectory prediction dataset explicitly designed for physical plausibility. Evaluated on UCF-101 and MSR-VTT, TrajVLM-Gen achieves Fréchet Video Distance (FVD) scores of 545 and 539, respectively—substantially outperforming state-of-the-art methods. Qualitative and quantitative analyses confirm significant improvements in both physical consistency and visual fidelity of generated videos.

Technology Category

Application Category

📝 Abstract

Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

Problem

Research questions and friction points this paper is trying to address.

Addressing physically inconsistent motion in video generation

Predicting coarse-grained trajectories with real-world physics

Guiding video generation through attention-based motion refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model predicts physics-consistent motion trajectories

Attention mechanisms guide fine-grained video generation

Trajectory dataset built from realistic video tracking data

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence