Tempered Self-Similarity Alignment for Physically Plausible Video Generation

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models often lack physical plausibility due to appearance drift, motion distortion, and temporal inconsistency. To address this, this work proposes a Temperature-controlled Self-similarity Alignment (TSA) loss that transfers spatiotemporal self-similarity (STSS) knowledge from vision foundation models into generative models. Specifically, STSS is modeled as a probabilistic correspondence distribution, and TSA aligns these distributions in dynamic regions to better capture realistic physical motion. Evaluated on the VideoPhy and VideoPhy2 benchmarks, the proposed method significantly improves the physical plausibility of generated videos across diverse interaction scenarios, demonstrating the effectiveness and generalizability of leveraging STSS for knowledge transfer in video generation.
📝 Abstract
Despite remarkable advances in video generative models, they still struggle to generate physically realistic videos, frequently exhibiting appearance drift, implausible motion, and temporal inconsistencies. In this work, we address this limitation by transferring relational knowledge encoded in spatio-temporal self-similarity (STSS) from visual foundation models into video generative models. STSS represents pairwise similarities among features across space and time, revealing the relational structure of how objects interact with other entities throughout a video, effectively capturing real-world dynamics, including object motion and semantic transformations. To transfer this relational knowledge, we propose Tempered Self-similarity Alignment (TSA) loss, which transforms STSS into probabilistic correspondence distributions and trains the video generative model to align its correspondence distributions with those of the visual foundation model on dynamically changing regions. Evaluated on VideoPhy and VideoPhy2 benchmarks, our method demonstrates substantial improvements in physical plausibility across diverse interaction scenarios, validating the effectiveness of transferring relational knowledge for physically realistic video generation.
Problem

Research questions and friction points this paper is trying to address.

physically plausible video generation
appearance drift
implausible motion
temporal inconsistencies
spatio-temporal self-similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-similarity alignment
spatio-temporal relational knowledge
physically plausible video generation
visual foundation models
correspondence distribution