🤖 AI Summary
Existing video diffusion models often struggle to achieve precise semantic alignment due to entity loss, attribute misalignment, and weakened prompt interaction dynamics. To address these limitations, this work proposes a semantics-adaptive relational alignment mechanism that operates on a frozen vision foundation model. It employs a text-guided continuous saliency mechanism to dynamically identify critical token pairs and leverages a routing operator to emphasize subject-subject and subject-background relationships. The approach is further enhanced through joint optimization with a lightweight Stage 1 aligner, SAM 3.1 entity mask supervision, InfoNCE regularization, and Token Relation Distillation (TRD). Evaluated under the Wan2.2 continual training setting, the method significantly outperforms SFT, VideoREPA, and MoAlign across 13-dimensional VLM scores, VBench benchmarks, and user blind tests, demonstrating consistent improvements in both textual alignment fidelity and motion quality.
📝 Abstract
Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.