Direct Motion Models for Assessing Generated Videos

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing video generation evaluation metrics (e.g., FVD) struggle to detect motion distortions and physically implausible object interactions. To address this, we propose the first motion-centric evaluation paradigm grounded in point trajectories—treating individual point tracks as fundamental units. Our method comprises trajectory extraction, variational autoencoder-based motion modeling, and motion feature embedding, enabling fine-grained per-video assessment, cross-video distribution comparison, and spatiotemporal error localization. It significantly enhances sensitivity to temporal inconsistencies and incorporates a human-perception consistency regression module to improve alignment with subjective quality ratings. Evaluated across multiple state-of-the-art video generation models, our approach outperforms 12 baseline metrics—including FVD—across all benchmarks. In temporal authenticity prediction, it achieves an average 23.6% improvement in Spearman correlation over prior methods and supports interpretable visualization of motion anomaly regions.

Technology Category

Application Category

📝 Abstract

A current limitation of video generative video models is that they generate plausible looking frames, but poor motion -- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: http://trajan-paper.github.io.

Problem

Research questions and friction points this paper is trying to address.

Assessing poor motion in generated videos

Measuring plausible object interactions and motion

Localizing generative video inconsistencies spatiotemporally

Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-encoding point tracks for motion features

Measuring plausible object interactions and motion

Spatiotemporally localizing video inconsistencies

🔎 Similar Papers

Detecting AI-Generated Video via Frame Consistency