Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address semantic drift and temporal inconsistency in latent video diffusion models (LVDMs) trained on noisy, weakly annotated web-scale video-text data, this paper proposes CAT-LVDM—the first contamination-aware training framework. Our method introduces two core innovations: (1) Batch-Centered Noise Injection (BCNI), which applies batch-wise perturbations aligned to semantic directions; and (2) Spectral-Aware Contextual Noise (SACN), which dynamically modulates noise structure via principal component spectrum analysis. We theoretically prove that low-rank-aligned perturbations tighten generalization bounds on entropy, Wasserstein distance, and score function drift. Experiments demonstrate that BCNI reduces Fréchet Video Distance (FVD) by 31.9% on WebVid-2M, while SACN improves classification accuracy by 12.3% on UCF-101. Together, they significantly enhance semantic fidelity and temporal coherence of generated videos.

Technology Category

Application Category

📝 Abstract

Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: https://github.com/chikap421/catlvdm

Problem

Research questions and friction points this paper is trying to address.

Improves robustness of text-to-video generation against noisy data

Reduces semantic drift and temporal incoherence in video diffusion

Enhances low-frequency smoothness and temporal consistency in outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Corruption-aware training with structured noise injection

Batch-Centered Noise Injection for temporal consistency

Spectrum-Aware Contextual Noise for low-frequency smoothness

🔎 Similar Papers

Latte: Latent Diffusion Transformer for Video Generation