SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing video diffusion models often struggle to achieve precise semantic alignment due to entity loss, attribute misalignment, and weakened prompt interaction dynamics. To address these limitations, this work proposes a semantics-adaptive relational alignment mechanism that operates on a frozen vision foundation model. It employs a text-guided continuous saliency mechanism to dynamically identify critical token pairs and leverages a routing operator to emphasize subject-subject and subject-background relationships. The approach is further enhanced through joint optimization with a lightweight Stage 1 aligner, SAM 3.1 entity mask supervision, InfoNCE regularization, and Token Relation Distillation (TRD). Evaluated under the Wan2.2 continual training setting, the method significantly outperforms SFT, VideoREPA, and MoAlign across 13-dimensional VLM scores, VBench benchmarks, and user blind tests, demonstrating consistent improvements in both textual alignment fidelity and motion quality.

📝 Abstract

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.

Problem

Research questions and friction points this paper is trying to address.

video diffusion models

text-to-video alignment

token-relation distillation

semantic relevance

prompt fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantically Adaptive Relational Alignment

Token-Relation Distillation

Text-Conditioned Saliency